Figures
Abstract
The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.
Citation: Zhang K, Wang Y, Li O, Hao S, He J, Lan X, et al. (2024) Improved self-training-based distant label denoising method for cybersecurity entity extractions. PLoS ONE 19(12): e0315479. https://doi.org/10.1371/journal.pone.0315479
Editor: Mohammed Zakariah, King Saud University, SAUDI ARABIA
Received: May 8, 2024; Accepted: November 27, 2024; Published: December 17, 2024
Copyright: © 2024 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data are available from the Kaggle database nvd_data(https://www.kaggle.com/datasets/adurey/nvd-data)”.
Funding: This work was supported in part by the Joint Innovation Fund of Sichuan University and the Nuclear Power Institute of China (Grant No. HG2022143), as well as the Sichuan Province Science and Technology Plan Key Research and Development Project (Grant No. 2023YFG0294.
Competing interests: The authors have declared that no competing interests exist.
Introduction
An abundance of security data exists in the form of unstructured natural language text or semi-structured vulnerability databases. These databases include the National Vulnerability Database(NVD) [1], Common Platform Enumeration (CPE) [2], and Exploit Database (Exploit-DB) [3]. When researchers perform cybersecurity-related tasks, such as intrusion detection, threat intelligence base building, etc., they need to perform Named Entity Recognition (NER) on these large amounts of cybersecurity data for structured security data analytics [4–6]. Therefore, the field of NER has attracted significant attention from researchers.
Gasmi et al. [7] proposed the use of LSTM-CRF for extracting network security entities, achieving good results by leveraging the memory effect of LSTM. Huang et al. [8] first introduced BiLSTM-CRF for NER tasks, which is one of the most representative deep learning-based NER models. Chui et al. [9] proposed the use of an LSTM-CNN model for NER, which utilizes CNN instead of CRF as a post-processing layer and can extract character-level features. Ma et al. [10] combined BiLSTM, CNN, and CRF, introducing BiLSTM-CNN-CRF, which was one of the best-performing NER models before the emergence of pre-training models. Tikhomirov et al. [11] introduced BERT into the field of network security NER, achieving better results compared to other traditional models.
In summary, many researchers have conducted studies on NER, primarily focusing on improvements in model architecture. However, NER still faces challenges such as a lack of high-quality annotated data and high annotation costs, especially in domains like cybersecurity that require expert knowledge. On one hand, training NER models with good performance requires a large amount of high-quality annotated data for supervised training. On the other hand, obtaining high-quality annotated data is difficult, particularly in domains like cybersecurity that are highly specialized and lack publicly available datasets with high confidence levels. Notably, NER models trained on specific datasets often underperform when applied to different ones [12]. Therefore, transferring NER models from one domain to another is not an effective approach.
To address these problems, in this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we build two cybersecurity domain dictionaries as remote knowledge bases. Secondly, we propose an algorithm called Reverse Maximum Matching combined with POS condition (referred to as RMP) to generate distant labels, then we utilize a high-confidence text selection criterion to identify and select high-confidence texts, train an initial NER model on these texts. Lastly, we propose a framework called Improved Self-Training (referred to as IST), which incorporates the teacher-student model and weight update constraints. This framework iteratively predicts and explores the true labels of low-confidence texts, updating the weights. By doing so, it reduces the impact of noisy labels and gradually improves the performance of the model. The experimental results demonstrate the effectiveness of our method. The main contributions of this paper are as follows:
- We propose a matching algorithm that combines Reverse Maximum Matching (RMM) with POS conditions. This method differs from traditional forward dictionary matching approaches and achieves higher precision and recall rates in distant supervision.
- We propose an improved self-training algorithm that combines a teacher-student model with weight update condition constraints. This algorithm can effectively reduce the impact of noisy labels.
- We evaluate the quality of the obtained distant labels and the effectiveness of the improved self-training algorithm on NER models. The experimental results demonstrate that the cybersecurity distant data we obtained has good quality, and the constrained self-training algorithm can effectively improve the performance of several state-of-the-art NER models on this dataset.
Related work
Cybersecurity entity extraction is a use of NER tasks in the field of cybersecurity. As an important task of NLP, NER has always been given attention by researchers and many related algorithms have been proposed. Most previous studies mainly focused on a fully supervised approach, which requires a well-annotated corpus. These studies usually consider the neural network models to fit label distribution. Huang et al. [8] presented the BiLSTM-CRF model, which is one of the most representative deep learning-based NER models. They considered named entity recognition as a sequence annotation task. Ma et al. [10] presented BiLSTM-CNNS-CRF, which uses Convolutional Neural Networks(CNN) to extract features from the characters of words, splice them with GloVe word embeddings, and finally feed them into BiLSTM-CRF for training.
In order to reduce human labour and quickly label a large amount of data, distant supervision is used to automatically label training data with a domain-specific dictionary or public knowledge base. Mayhew et al. [13] proposed Constrained Binary Learning (CBL), a self-training algorithm that iteratively identifies true negative labels in NER to mitigate noise impact and improve the effectiveness of NER models.
Liang et al. [14] combined this technology with the pre-training model, and put forward the BOND model, which was applied to the long-distance annotated text obtained based on knowledge base matching, and used the way of early stopping to prevent the fitting noise. Meng et al. [15] presented the RoSTER model, which removes noise labels by introducing a noise robustness loss function based on generalized cross-entropy and makes the model more stable by adopting model integration and other methods.
Compared to other domains, entity extraction in cybersecurity is less studied and faces many problems. Gasmi et al. [7] proposed LSTM-CRF to extract entities in cybersecurity, and also performed feature engineering without expert knowledge. Satyapanich et al. [16] expanded the entity categories and utilized BERT and Word2Vec pre-trained word embeddings in combination with a BiLSTM model to extract entity information from network security articles. Georgescu et al. [17] designed an automated diagnostic system for the Internet of Things (IoT) cybersecurity environment. They used a semi-supervised method Watson Knowledge Studio tool to train the NER model for heterogeneous IoT data. Evangelatos et al. [18] employed fine-tuning of state-of-the-art transformer-based models for NER in Cyber Threat Intelligence. Alam et al. [19] proposed CyNER, a user-friendly Python library that provides access to transformer-based NER models pre-trained on high-fidelity security events along with a heuristic model for extracting threat indicators. Zhen et al. [20] proposed RoBERTa-wwm-RDCNN-CRF, an end-to-end NER model for Chinese Cyber Threat Intelligence that utilizes a robustly optimized pre-trained RoBERTa with whole word masking, leading to significant improvements in entity recognition performance. Hu et al. [21] introduced JCLB, an innovative model that integrates Contrastive Learning with a Belief Rule Base to enhance NER accuracy in the cybersecurity domain.
In summary, while numerous studies have focused on improving Named Entity Recognition (NER) through advancements in model architecture, significant challenges remain, especially in specialized domains like cybersecurity. The complexity and diversity of cybersecurity texts make it difficult for traditional NER methods to accurately extract security-related entities. High-performing NER models typically require large volumes of high-quality annotated data for supervised training, but the lack of a domain-specific corpus poses a significant obstacle. Moreover, the scarcity of publicly available, reliable datasets makes obtaining high-quality annotations even more challenging. As a result, NER models trained on one dataset often struggle to generalize effectively to others. Additionally, many existing models struggle with handling and detecting noise in the data. To address these issues, we propose an enhanced self-training-based distant label denoising method specifically tailored for cybersecurity entity extraction.
Proposed method
To address the issue of manual annotation dependency in cybersecurity NER, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction, which includes domain dictionary construction, distant label generation, high-confidence text selection, and an improved self-training approach. The overall framework of the method is illustrated in Fig 1.
Domain dictionary construction
This section elucidates the construction of domain dictionary. We use the NVD’s Description text as the target corpus, and the CPE list as the external knowledge base. As indicated in Fig 2, a CPE URI includes multiple fields. We choose “vendor” and “product” as the distant labels to be annotated for two reasons: (1) These two labels are important in the field of cybersecurity as they help in quickly identifying potential targets under threat. (2) These two labels are more commonly found in NVD texts and are easier to extract, whereas other labels may not always be present.
There are two types of CPE dictionaries that we can use: a list of CPE URIs provided by the NVD for related vulnerabilities, known as Known Affected Software Configurations, and a comprehensive official CPE dictionary. In the subsequent description, we will refer to the distant labels generated using the first dictionary as “Label 1” and the ones generated using the second dictionary as “Label 2”. It should be noted that Label 1 is more precise, while Label 2 provides a broader coverage of vulnerability.
Distant label generation
Distant labels in an unlabeled corpus are typically obtained by matching the entity with those in the external knowledge base. In this work, we employ the reverse maximum matching algorithm to identify entities from the sentences that exist in a pre-defined dictionary. After that, the identified entities undergo POS condition filtering to assign them the appropriate class labels. Our RMP algorithm consists of two main parts: reverse maximum matching and POS condition filtering.
Algorithm 1: RMM Algorithm.
Input: An unlabeled text S = {x1…xn};
external knowledge base D;
the maximum length of entity k.
Output: Distant label .
1 foreach i ∈ [1, …, n] do
2 //Initialize the distant labels as “O”;
3
4 end
5 i ← n;
6 while i > 1 do
7 jmax ← min{k, i − 1};
8 t ← 1;
9 for j ← jmax, …, 1 do
10 ent ← {xi−j…xi};
11 if ent ∈ D then
12 ;
13 break;
14: end
15 //update the position of i;
16 i ← i − t
17: end
Reverse maximum matching.
The Reverse Maximum Matching (RMM) algorithm are based on the greedy approach of maximum matching, where the analyzed string is matched with the entries in a dictionary according to certain strategies, the principle is to maximize the length of individual words or entities whenever possible. If a string is found in the dictionary, it is considered a successful match. For example, the algorithm can successfully match a specific entity like “windows_10” rather than a broader equivalent like “windows”, which can be more accurate. The algorithm is not only used for word segmentation but can also be applied to sequence labelling tasks [22–25]. The advantages of RMM include its fast speed and high accuracy, as it does not generate strange words. However, it may introduce noisy labels and struggle with handling out-of-vocabulary words. Due to its limited scalability, it is suitable for domain-specific NER tasks. Therefore, we have chosen to use them as our distant labelling approach.
Algorithm 2: RMP Algorithm.
Input: CPE list Lcpe;
text S = {x1…xn};
POS tag ;
the maximum length of entity k.
Output: Distant label .
1 //Initialize the dictionary;
2 Dven ← Set(), Dpro ← Set();
3 for l ∈ Lcpe do
4 Dven ← The string between the third colon and the fourth colon in l;
5 Dpro ← The string between the fourth colon and the fifth colon in l;
6 end
7 //Initialize the distant labels as “O”;
8 ;
9 while i ≤ n do
10 jmax ← min{k, i − 1}, t ← 1;
11 for j ← jmax, …, 1 do
12 hasNN ← False;
13 ent ← {xi−j…xi};
14 if {} then
15 hasNN ← True;
16 if ent ∈ Dven and hasNN = True then
17 ;
18 t ← j;
19 break;
20 if ent ∈ Dpro and hasNN = True then
21 ;
22 t ← j;
23 break;
24 end
25 //update the position of i;
26 i ← i − t
27 end
The overall flow of the RMM algorithm is presented in Algorithm 1. RMM algorithm has several advantages: (1) Higher efficiency: Since keywords and phrases in most languages tend to appear at the end of sentences, the RMM algorithm can quickly find matches, thereby improving search efficiency. (2) For longer target strings, the RMM algorithm is generally faster than the forward traversing. This is because the RMM algorithm starts matching from the end, allowing for quicker exclusion of non-matching parts and reducing the search scope.
POS condition filtering.
To improve the accuracy of matching, we have introduced POS as a filtering criterion to determine whether to assign labels based on the corresponding POS tags of the text. We have set two filtering conditions: (1) Only words tagged as noun POS will be assigned labels, specifically NN, NNS, NNP, and NNPS. Otherwise, the label will be set as “O”. (2) If a word is followed by a numeral POS, specifically CD, the word will be assigned the label “Product”.
The reason for setting the first condition is evident because both Vendor and Product types are nouns. If a word with a different POS is matched, it indicates that the word is not the desired type in the current context. The reason for setting the second condition is that sometimes Vendor and Product names can be the same, but if followed by a numeral, it generally indicates a product version in cybersecurity texts. Therefore, it is considered as a Product. With these two simple filtering conditions, we can obtain more accurate distant labels.
Overall algorithm.
We summarize the entire process of distant labelling as shown in Algorithm 2. Lines 1-8: Generating the remote dictionary knowledge base. Line 10: Initializing remote labels as “O”. Line 12: Truncating words starting from the maximum length. Lines 13-25: Performing distant labelling. Line 27: Updating the starting position.
Noise elimination through improved self-training algorithm
In section, we presented a distant labelling method that aids in building a cybersecurity corpus. Nevertheless, distant supervision can result in some data being improperly labelled, thus creating noise. To mitigate this issue, we introduced an improved self-training learning framework designed to negate the effects of such noisy data. In this section, we provide a detailed description of the improved self-training method we propose. Step 1: We train an initial NER model using high-confidence texts. Step 2: We use the NER model to predict pseudo-labels for the low-confidence texts. Step 3: Based on the predicted labels and constraint conditions, we update the weights. The updated weights are then used to select high-confidence texts for the next round of training. Step 4: We iterate this training process until a pre-defined threshold is reached.
High confidence text selection.
The first step of our method is to define high-confidence text, which will be used to train the initial NER model. Subsequent self-training iterations will be performed based on this initial model. We adhered to the following criteria while selecting high-confidence texts:
- Words labelled by Label 1 that are not classified as “O” are selected as it has high confidence.
- Words concurrently labelled as “O” by Label 1 and Label 2 are considered to have a high confidence label of “O”.
The logic behind the selection of high-confidence text is that Label 1 is more precise, indicating entities that are more likely to align with the true labels. On the other hand, Label 2 is more comprehensive, so if both Label 1 and Label 2 classify a token as a non-entity, there is a high probability that it is indeed a non-entity.
NER training.
In general, NER models are trained as sequence labelling models. Specifically, for a sequence S = [x1, …, xn] and its corresponding labels L = [y1, …, yn], a NER model with parameters θ outputs class probabilities after applying the softmax function. The most commonly used loss function in NER is cross-entropy (CE): (1) where represents the probability (i.e., the softmax outputs) that the output class predicted by the model for a token xi will be yj. The meanings of some formula symbols used in this paper are described in Table 1.
In order to adapt the NER model to the algorithm and data proposed in this paper, we expand Eq 1, and provide a weighted loss function demonstrated in Eq 2. Here, wi denotes the weight of the training data. If wi = 1, it indicates that the data instance is from high-confidence text and should be included in the loss computation during training. On the other hand, if wi = 0, it signifies that the data instance is from low-confidence text and should not contribute to the loss computation. (2)
Constraints formulas.
The basic idea of the IST algorithm is to first train an initial NER model on high-confidence text. Then, this model is used to generate soft-label predictions for the low-confidence text. After applying certain constraint conditions, some low-confidence text instances are selected and promoted to high-confidence text. This iterative training process continues until a stop condition is met. Here, the objective function for the constraint selection process is defined as follows: (3) where Du represents the set of low-confidence text instances, and ci takes values 0 or 1, indicating whether the i-th token will be updated to high-confidence. C represents the number of classes, pi,j represents the predicted probability of the i-th token in the j-th class, yi,j takes values 0 or 1, indicating whether the i-th token is classified as the j-th class. The objective of this function is to select the low-confidence text instances with the highest predicted probabilities, i.e., the ones with the highest confidence. (4) (5) (6)
The constraints of the IST algorithm are represented by three equations, as shown in Eqs 4, 5 and 6. In these equations, Cupdate represents the summation of all ci values, while L represents the total number of labels. Cupdate directly determines the number of data instances that participate in the next iteration. There are two ways to obtain Cupdate: one is to use a threshold value, as shown in Eq 7, where the count is calculated for values greater than t; the other is to use a random proportion, as shown in Eq 8. (7) (8) Where r is a value between 0% and 100%.
Eq 6 represents the entity ratio constraint. This constraint is based on the observation made by Augenstein et al. [26] in their research that well-annotated datasets typically have an entity ratio of around 0.09±0.05. Therefore, this constraint is used to limit the entity label ratio obtained during the training process. If the ratio exceeds the highest limit, the low-confidence labels with relatively lower probabilities are put aside and no longer considered as high-confidence labels.
Algorithm 3: IST Algorithm.
Input: Pretrained model ;
high-confidence text ;
low-confidence text ;
stopping condition Cstop;
weight update constraint Cupdate.
Output: Final student model f(⋅, θstu).
1 //Initialize weight and model;
2 The ’s text weight , the ’s text weight ;
3 for t1 ∈ [1, …, T1] do
4 Train on ;
5 end
6 //teacher-student framework;
7 , ;
8 while Does not meet the stopping condition Cstop do
9 //pseudo-label ;
10 Use f(⋅, θtea) to predict the label of ;
11 Update based on the constraint Cupdate, and use the hard label y as the label for the data corresponding to ;
12 for t2 ∈ [1, …, T2] do
13 Train on and ;
14 end
15 ;
16 end
Improved self-training algorithm.
Drawing inspiration from the CBL framework proposed by Mayhew et al. [13], we integrated a constraint method with a self-training framework based on the teacher-student model [27], enabling the iterative learning of the model parameter θ. The principal process is depicted in Algorithm 3. Lines 1-5: Initialize the weights for high-confidence and low-confidence text. Train the initial NER model using high-confidence text. Line 6: Initialize the teacher and student model parameters with the parameters of the initial model. Lines 10-11: Use the teacher model to predict soft labels for the low-confidence text. Update the weights based on the condition for weight update using the soft labels. Lines 12-14: Train the student model using the hard labels corresponding to the updated data. Line 15: Assign the trained student model parameters to the teacher and student models for the next iteration.
The training process continues iteratively until a stop condition is met. In this specific implementation, the stop condition is defined as the point where the number of occurrences of the two types of entities no longer significantly increases. At this point, the final trained student model is considered as the output model.
Experiment
Dataset
We randomly downloaded 934 CVE vulnerability information from NIST [1] and manually annotated them based on tokens in IO format for later evaluation. We divided the dataset into training and testing sets in an 8:2 ratio. The specific quantities of sentences, tokens, and entities are shown in Table 2.
Table 3 shows the partial annotation data for vulnerability number CVE-2021-1527. The first column is the natural language text that uses NLTK for word segmentation; the second column is the part-of-speech tag obtained using NLTK; the third column is the remote tag obtained using the current dictionary; the fourth column is the manually annotated real tag; the fifth column gives the vulnerability number to which the natural language text belongs.
Compared methods
We compared several state-of-the-art NER models in terms of their performance on distant supervision and full supervision. Full supervision involved training the models using a ground truth training set, while distant supervision involved comparing the performance of Label 1 and our proposed IST algorithm. All methods were evaluated on the test set.
- FMM: We compare the Forward Maximum Matching (FMM) algorithm [28] with the RMM algorithm. The FMM algorithm is similar to the RMM algorithm, but it scans the text from left to right. This method is commonly used for dictionary-based matching.
- BiLSTM-CRF: Classic NER model [8]. BiLSTM (Bi-directional LSTM) is an improvement over LSTM as it not only considers the historical information of words but also incorporates future information, resulting in better word vector representations. The CRF layer is to serve as a post-processing component, effectively constraining the model’s output to produce correct class labels.
- LSTM-CNN: [9] Instead of using CRF, it uses CNN as the post-processing layer. In this context, CNN is employed to extract character-level features for each word.
- BiLSTM-CNN-CRF: [10] One of the state-of-art NER models before the pre-trained language model, consisting of BiLSTM, CNN and CRF.
- RoBERTa: [29] RoBERTa (Robustly Optimized BERT Pretraining Approach) is an improved model based on BERT. It incorporates several modifications and enhancements to further enhance performance.
- RoBERTa-wwm-RDCNN-CRF: [20] This state-of-the-art NER model integrates a residual dilated convolutional neural network (RDCNN) with a conditional random field (CRF), enhancing entity recognition performance in Chinese Cyber Threat Intelligence.
Experiment settings and metrics
Experiment settings.
We set the output dimension of word2vec vectors to 100, BERT output vectors to 768, and RoBERTa output vectors to 768. The LSTM model utilizes 256 LSTM cells, while the BiLSTM model employs 128 BiLSTM cells. The CNN model consists of 5 parallel 1-dimensional CNN layers, each with 256 filters. Eq 5 utilizes the Cupdate determined by Eq 8, with r set to 0.2. The training rounds for high-confidence text T1 are set to 2, and the stopping condition for self-training Cstop is when there are no newly selected high-confidence texts satisfying the constraint condition. The ratio of our training set to test set was 8:2.
Metrics.
For each experiment, we evaluate the performance using precision, recall, and F1 score. A higher precision indicates that the model’s predictions are more reliable, while a higher recall means that the model can better capture the actual positive instances. The F1 score combines precision and recall, making it a useful metric for overall quality assessment. These metrics provide a comprehensive understanding of the model’s performance in different aspects.
Precision. Precision is the ratio of correctly predicted positive observations to the total predicted positives. It reflects the accuracy of the model when it predicts a positive class. The formula is as follows: where TP (True Positive) represents the number of true positives, and FP (False Positive) represents the number of false positives.
Recall. Recall is the ratio of correctly predicted positive observations to all observations in the actual class. It reflects the model’s ability to capture all positive samples. The formula is as follows: where FN (False Negative) represents the number of false negatives.
F1 score. The F1 score is the harmonic mean of precision and recall, used to provide a balanced evaluation of the model’s precision and recall capabilities. The formula is as follows:
The F1 score balances precision and recall, making it particularly useful for datasets with imbalanced classes.
Distant label construction experiment results
We conduct experimental evaluations on the distantly generated labels using the RMP algorithm to showcase the quality of our distant labels, as shown in Table 4.
Main results.
We compare RMP with FMP (Forward Maximum Matching algorithm with POS) in terms of constructing distant labels on Label 1 and Label 2. Table 4 presents the precision, recall, and F1 scores of both methods for two entity classes, along with the corresponding micro and macro metrics. On both labels, RMP outperformed FMP, indicating that RMP is indeed superior to FMP on this dataset. Specifically, (1) the overall performance of Label 1 is better than Label 2, particularly in terms of precision and F1 scores. However, in terms of recall, especially for the Product class, Label 2 exhibits higher recall than Label 1. This may be because the Product class has a broader scope, which Label 1 cannot fully cover, while Label 2, with its larger dictionary space, can encompass it effectively; (2) due to the larger dictionary space in Label 2, its precision significantly lags behind Label 1, introducing more noise. On the other hand, Label 1 maintains overall high quality. Therefore, high-confidence texts primarily rely on Label 1.
Ablation study.
In order to further validate the effectiveness of the proposed POS condition filtering, we conduct ablation experiments. “w/o POS” denotes the condition where POS filtering is not applied. The results are shown in Table 5. It can be observed that, for both Label 1 and Label 2, removing the POS condition filtering led to a certain decrease in label quality. While the recall slightly increased due to the absence of limiting conditions, the significant decrease in precision and F1 scores outweighs this minor increase. These results demonstrate the necessity of adding POS condition restrictions to improve the quality of distant labels.
Improved self-training experiment results
To validate the effectiveness of the proposed IST algorithm in mitigating the impact of noise in distant labels and improving the performance of NER models, we conduct experiments using various NER models that have been proven to perform well on NER tasks. We selected different embedding methods and network architectures for comparison.
Main results.
Table 6 presents the performance comparison of all methods in terms of precision, recall, and F1 score. From the table, it can be observed that the performance of all models trained on distant labels is significantly inferior to that of models trained under full supervision with true labels. This indicates that the distant labels contain a substantial amount of noise, leading to a decrease in model performance. However, our proposed IST algorithm has greatly improved the metrics for all models compared to training directly on Label 1. Specifically, (1) all models show improvements in precision, recall, and F1 score. Notably, the proposed constrained self-training algorithm significantly enhanced the F1 scores of several state-of-the-art NER models on this dataset, achieving a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class. Among them, BiLSTM-CRF exhibits the largest improvement in F1 score for the Vendor class, with an increase of 4.35%, while the BiLSTM-CNN-CRF model demonstrates the largest improvement of 3.95% in the Product class. (2) The best-performing models among the five are BiLSTM-CNN-CRF, RoBERTa, and RoBERTa-wwm-RDCNN-CRF with RoBERTa-wwm-RDCNN-CRF’s IST algorithm producing results closest to those obtained with true label training. The results indicate that pre-trained models, due to their better generalization capabilities, can fit the true distribution more effectively during self-training, resulting in overall better performance. The experimental results demonstrate that our proposed IST algorithm effectively reduces the noise impact of distant labels and improves the performance of the NER models.
Parameter study.
We explored the influence of two hyperparameters t and r on the experimental results, as shown in Eqs 7 and 8. Fig 3 presents the experimental results, where the blue curve represents the influence of hyperparameter r, and the red curve represents the influence of hyperparameter t. From the experimental analysis, we observed that the model is relatively insensitive to t, but the F1 peak achieved by it is lower than that of r. Additionally, r remains insensitive within a certain range to achieve the F1 peak. Therefore, for actual model training and application, we chose r to constrain the value of Cupdate, and the selected value for r is 0.2.
Conclusion and future work
In this study, we investigate the problem of distant supervision for NER in the field of cybersecurity. To obtain the distant labels, we first create two cybersecurity domain dictionaries, then we propose a method that combines reverse maximum matching and POS tag filtering to construct a distant dataset. To mitigate the noise introduced by distant labels and improve the performance of NER models, we introduce an improved self-training algorithm. This algorithm first trains the NER model on high-confidence texts and then updates the confidence weights through constraint conditions using a teacher-student model for self-training. Our approach effectively alleviates the noise impact of distant labels and significantly improves the performance of multiple state-of-the-art NER models on this distant dataset.
While our proposed method demonstrates significant potential, it is important to highlight some limitations and potential areas for future research. Firstly, we only utilized two cybersecurity lexicons for constructing the distant labels, which may limit the coverage of domain-specific terms. In future work, we could expand our approach by incorporating additional and more comprehensive cybersecurity dictionaries. Secondly, the IST algorithm proposed in this study relies heavily on the quality and quantity of annotated data. This can pose challenges in scenarios where labeled data is scarce. To address this limitation, future research could explore the integration of meta-learning and other low-resource domain training techniques with natural language processing. Such approaches could guide the selection of labeled data more effectively and tackle the issue of limited resources in small-sample scenarios.
References
- 1.
National Vulnerability Database;. https://nvd.nist.gov/vuln/data-feeds.
- 2.
Official Common Platform Enumeration (CPE) Dictionary;. https://nvd.nist.gov/products/cpe.
- 3.
The Exploit Database;. https://www.exploit-db.com.
- 4. Gao Y, Li X, Peng H, Fang B, Philip SY. Hincti: A cyber threat intelligence modeling and identification system based on heterogeneous information network. IEEE Transactions on Knowledge and Data Engineering. 2020;34(2):708–722.
- 5. Biswas B, Mukhopadhyay A, Bhattacharjee S, Kumar A, Delen D. A text-mining based cyber-risk assessment and mitigation framework for critical analysis of online hacker forums. Decision Support Systems. 2022;152:113651.
- 6. Sun N, Ding M, Jiang J, Xu W, Mo X, Tai Y, et al. Cyber threat intelligence mining for proactive cybersecurity defense: a survey and new perspectives. IEEE Communications Surveys & Tutorials. 2023;.
- 7. Gasmi H, Bouras A, Laval J. LSTM recurrent neural networks for cybersecurity named entity recognition. ICSEA. 2018;11:2018.
- 8. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. ArXiv. 2015;abs/1508.01991.
- 9. Chiu JP, Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the association for computational linguistics. 2016;4:357–370.
- 10.
Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics; 2016. p. 1064–1074. Available from: https://aclanthology.org/P16-1101.
- 11.
Tikhomirov M, Loukachevitch N, Sirotina A, Dobrov B. Using bert and augmentation in named entity recognition for cybersecurity domain. In: Natural Language Processing and Information Systems: 25th International Conference on Applications of Natural Language to Information Systems, NLDB 2020, Saarbrücken, Germany, June 24–26, 2020, Proceedings 25. Springer; 2020. p. 16–24.
- 12. Giorgi JM, Bader GD. Towards reliable named entity recognition in the biomedical domain. Bioinformatics. 2020;36(1):280–286. pmid:31218364
- 13.
Mayhew S, Chaturvedi S, Tsai CT, Roth D. Named entity recognition with partially annotated training data. arXiv preprint arXiv:190909270. 2019;.
- 14.
Liang C, Yu Y, Jiang H, Er S, Wang R, Zhao T, et al. Bond: Bert-assisted open-domain named entity recognition with distant supervision. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2020. p. 1054–1064.
- 15.
Meng Y, Zhang Y, Huang J, Wang X, Zhang Y, Ji H, et al. Distantly-supervised named entity recognition with noise-robust learning and language model augmented self-training. arXiv preprint arXiv:210905003. 2021;.
- 16.
Satyapanich T, Ferraro F, Finin T. Casie: Extracting cybersecurity event information from text. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34; 2020. p. 8749–8757.
- 17. Georgescu TM, Iancu B, Zurini M. Named-entity-recognition-based automated system for diagnosing cybersecurity situations in IoT networks. Sensors. 2019;19(15):3380. pmid:31374902
- 18.
Evangelatos P, Iliou C, Mavropoulos T, Apostolou K, Tsikrika T, Vrochidis S, et al. Named entity recognition in cyber threat intelligence using transformer-based models. In: 2021 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE; 2021. p. 348–353.
- 19.
Alam MT, Bhusal D, Park Y, Rastogi N. Cyner: A python library for cybersecurity named entity recognition. arXiv preprint arXiv:220405754. 2022;.
- 20. Zhen Z, Gao J. Chinese Cyber Threat Intelligence Named Entity Recognition via RoBERTa-wwm-RDCNN-CRF. Computers, Materials & Continua. 2023;77(1).
- 21. Hu C, Wu T, Liu C, Chang C. Joint contrastive learning and belief rule base for named entity recognition in cybersecurity. Cybersecurity. 2024;7(1):19.
- 22. Li H, Chen PH. Improved backtracking-forward algorithm for maximum matching Chinese word segmentation. Applied Mechanics and Materials. 2014;536:403–406.
- 23. Yan X, Xiong X, Cheng X, Huang Y, Zhu H, Hu F. HMM-BiMM: Hidden Markov Model-based word segmentation via improved Bi-directional Maximal Matching algorithm. Computers & Electrical Engineering. 2021;94:107354.
- 24. Yang L, Suge W, et al. Name entity recognition in legal instruments based on matching strategy and community attention mechanism. Journal of Chinese Information Processing. 2022;36(2):85–92.
- 25. Jiang L, Shi J, Pan Z, Wang C, Mulatibieke N. A multiscale modelling approach to support knowledge representation of building codes. Buildings. 2022;12(10):1638.
- 26. Augenstein I, Derczynski L, Bontcheva K. Generalisation in named entity recognition: A quantitative analysis. Computer Speech & Language. 2017;44:61–83.
- 27.
Xie Q, Luong MT, Hovy E, Le QV. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 10687–10698.
- 28.
Wong Pk, Chan C. Chinese word segmentation based on maximum matching and word binding force. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics; 1996.
- 29. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv. 2019;abs/1907.11692.