Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Domain generation algorithms detection with feature extraction and Domain Center construction

  • Xinjie Sun ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    sxj123lps@163.com

    Affiliations Institute of Computer Science, Liupanshui Normal University, Liupanshui, Guizhou, China, Guizhou Xinjie Qianxun Software Service Co., Ltd, Liupanshui, Guizhou, China

  • Zhifang Liu

    Roles Investigation, Software, Visualization, Writing – original draft

    Affiliation Institute of Computer Science, Liupanshui Normal University, Liupanshui, Guizhou, China

Abstract

Network attacks using Command and Control (C&C) servers have increased significantly. To hide their C&C servers, attackers often use Domain Generation Algorithms (DGA), which automatically generate domain names for C&C servers. Researchers have constructed many unique feature sets and detected DGA domains through machine learning or deep learning models. However, due to the limited features contained in the domain name, the DGA detection results are limited. In order to overcome this problem, the domain name features, the Whois features and the N-gram features are extracted for DGA detection. To obtain the N-gram features, the domain name whitelist and blacklist substring feature sets are constructed. In addition, a deep learning model based on BiLSTM, Attention and CNN is constructed. Additionally, the Domain Center is constructed for fast classification of domain names. Multiple comparative experiment results prove that the proposed model not only gets the best Accuracy, Precision, Recall and F1, but also greatly reduces the detection time.

1 Introduction

Malware has now developed into the number one public enemy threatening network security. In order to avoid the detection of security facilities, its production process is becoming more and more complex. One typical approach is to integrate Domain Generation Algorithm (DGA) [1] into the software to generate a large number of rapidly changing domain names. As a backup or main means of communication with Command and Control (C&C) server, this method can effectively increase the robustness of botnet [2], so as to continuously control the infected host. Correspondingly, the research on DGA algorithm has becoming a hot topic of network security. However, due to the fast updating speed of DGA domains, existing research methods have too many false positives in practical use. Therefore, the detection of DGA domains is still an arduous task in the computer security field.

Discovering DGA domains is very important for maintaining network security. The existing solutions mainly include static blacklist [3], reverse engineering [4], machine learning [5] and deep learning [6]. Due to the slow update speed of static blacklist and the fast update speed of DGA domains, it is difficult to effectively apply static blacklist to DGA detection. Reverse engineering requires a malware sample, which is not always feasible for DGA detection. In recent years, machine learning and deep learning methods provide new hope for DGA detection.

Machine learning and deep learning methods construct the feature set of domain names and combine machine learning or deep learning models to detect DGA domains. Since the deep learning model has stronger nonlinear modeling ability than the machine learning model, it can detect DGA domains more accurately. Although there have been some studies on detecting DGA domains through deep learning models [79], most of them only construct feature set through domain name. Due to the limited effective information contained in the domain name, the DGA domains can not be accurately detected. To overcome the above shortcomings, a feature set with rich features is constructed. Considering that the Whois [10] of the domain name contains rich features (e.g. registrar, registration time, etc.) related to the domain name category, the Whois features are extracted to construct the feature set. Considering that the N-gram [11] features of the domain name also contains rich information that can reflect whether the domain name is malicious, the blacklist and whitelist substring datasets of the domain name are constructed to obtain the N-gram features.

Due to the great ability in sequential modeling, Recurrent Neural Network (RNN) [12] is widely used in DGA detection among the vast type of deep learning models. Although RNN has made some achievements in DGA detection, it still has defects. Firstly, when the sequence is too long, the gradient disappearance problem inevitably occurs in RNN. Secondly, RNN assigns the same weight to all features. Finally, the high dimensionality of RNN makes the model difficult to converge. In the deep neural network, it has been proved that the more complex the neural network structure is, the better the effect is [13]. Therefore, Bi-Directional Long Short-Term Memory (BiLSTM), Attention and Convolutional Neural Network (CNN) are used to construct the DGA detection model. BiLSTM [14] is used to solve the gradient disappearance problem. Attention mechanism [15] is used to assign different weights to different features. CNN [16] is used to reduce the high dimensionality problem. In addition, skip connect [17] is used at the output of Attention network to solve the gradient disappearance and weight matrix degradation problems.

Although the classification results of the deep learning model will get better as the number of layers increases, the time spent will also increase [13]. Therefore, in order to reduce the DGA detection time on the validation set, the Domain Center is constructed. When there is a new domain name input, the feature vector of it is first obtained, then the hidden vector of it is obtained by the constructed deep learning model. Finally, the Euler distances [18] between the hidden vector and the mean vectors stored in the Domain Center are calculated to obtain the final classification results.

The main innovations of us are as follows:

  1. A feature set including the domain name features, the Whois features and the N-gram features is constructed. To obtain the N-gram features, the domain name whitelist and blacklist substring feature sets are constructed.
  2. A deep learning model based on BiLSTM, Attention, skip connect and CNN is constructed.
  3. The Domain Center is proposed to reduce the DGA detection time on the validation set.

The remainder of this work is organized as follows. Section 2 introduces the latest research results of DGA detection; Section 3 introduces the background of BILSTM, Attention mechanism and CNN; Section 4 introduces the construction method of the feature set; Section 5 introduces the structure of the deep learning model constructed in this paper; Section 6 introduces the data set selected in this paper and the experimental results; Section 7 provides a final conclusion.

2 Related work

DGA detection methods include blacklist, reverse engineering, machine learning and deep learning. Although the blacklist method can provide effective security and is used by most network security companies [19, 20], its inherent defects in update speed make it easy for DGA domains to bypass the detection of blacklist. Reverse engineering requires a malware sample, which is not always feasible for DGA detection [21]. Therefore, most of the researches on DGA detection focus on machine learning and deep learning methods.

Machine learning method first construct DGA domain feature set, and then realizes DGA detection by machine learning models. Tuan et al. [22] proposed a machine learning based DGA detection model using TF-IDF and n-gram for feature representation. The results showed that logistic regression and SVM were the most effective. Štampar et al. [23] engineered a robust feature set, and accordingly trained and evaluated 14 ML, 9 DL, and 2 comparative models on two independent datasets. The experimental results showed that if ML features are properly engineered, there is a marginal difference in overall score between top ML and DL representatives. Soleymani et al. [24] applied machine learning algorithm and text mining technology to analyze DNS protocol and identify DGA botnets. The experimental results showed that the Random Forest could be effectively used in DGA botnet detection and had the best DGA botnet detection accuracy. Chin et al. [25] proposed a machine learning framework for identifying and clustering domain names to circumvent threats from the DGAs. Li et al. [26] proposed a machine learning framework (a two-level model and a prediction model) for identifying and detecting DGA domains to alleviate the threat of them. Baruch et al. [27] surveyed different machine learning methods for detecting DGAs by analyzing only the alphanumeric characteristics of the domain names in the network.

The deep learning method also needs to construct the feature set of DGA domains, and then realizes DGA detection through the deep learning models. Tuan et al. [28] proposed solutions for detecting and classifying DGA families. They proposed two deep learning models called LA_Bin07 and LA_Mul07 by combining the LSTM network and Attention layer. The experimental results showed that the LA_Bin07 and LA_Mul07 models solved the DGA botnets problem for binary and multiclass classification problems with very high accuracy. Namgung et al. [29] proposed an efficient DGA detection method based on BiLSTM, which further maximized the detection performance by using the CNN + BiLSTM integrated model, and allowed the model to learn local and global information at the same time in the domain sequence. The experimental results showed that the existing CNN and LSTM models had obtained F1 scores of 0.9384 and 0.9597 respectively, while the proposed BiLSTM and integrated model had obtained F1 scores of 0.9618 and 0.9666 respectively. Liang et al. [30] proposed three feature extraction methods adapted to the length of the DGA domains. In addition, they further analyzed the public suffix to evaluate its impact on the detection of DGA domains. The experimental results showed that the method greatly improved the detection performance. Lison et al. [31] demonstrated that a deep learning approach based on RNN was able to detect domain names generated by DGAs with high precision. Xu et al. [32] combined n-gram and a deep CNN to propose a novel n-gram combined character based domain classification (N-CBDC) model. Experiments on real-world data showed that N-CBDC could effectively detect DGAs. Ren et al. [33] proposed a deep learning framework for identifying and detecting DGA domains.

It can be seen from previous researches that the primary task of both machine and deep learning methods is to construct the DGA domain feature sets, which plays a key role in DGA detection. Considering that the existing researches basically construct feature set based on domain name only, which contains limited features, domain name features, Whois features and N-gram features are combined to construct feature set containing rich features.

3 Background

3.1 LSTM

LSTM [34] is a variation of RNN [12], and the structure of LSTM is shown in Fig 1. LSTM includes an input gate it, a forget gate ft and an output gate ot. The forget gate ft accepts the output of the previous unit module Ct−1 and decides which part to keep and forget, which is calculated as follows [34]: (1) Where xt is the current input, σ() is the element-wise sigmoid function, Wf and Uf are the weight matrices and bf is the bias term. The input gate it determines which information is recorded into the cell state, and the cell state ct is obtained by merging it and the new memory . The formula of the input gate is as follows [34]: (2) (3) (4) Where Wi, Wc, Ui and Uc are the weight matrices, bi and bc are the bias terms, ⊗ is the element-wise multiplication, tanh() is the activation function Relu. The output gate ot determines the output value based on the cell state ct. A Sigmiod function is first used to determine which part of ct needs to be output, then ct is processed through the tanh() layer, and finally ot and tanh(ct) are multiplied to get the final desired output. Which can be denoted by [34]: (5) (6) Where Wo and Uo are the weight matrices and bo is the bias term.

3.2 Attention mechanism

As the length of input sentences increases, the ability of LSTM to remember connections between words that are too far apart in a sentence decreases. Attention mechanism [15] solves the above problem by considering all input words to create a context vector and assigning relative weights to them. The structure of the Attention mechanism is shown in Fig 2. In Fig 2, x = (x1, x2, ⋯, xT) represents the input of LSTM, h = (h1, h2, ⋯, hT) represents the output through hidden layer of LSTM. The correlation etj between the jth input hj and the current hidden state st−1 is calculated as follows [15]: (7) Where score() is a correlation operator, and the weighted dot product is chosen in this paper. A softmax transformation is performed on etj to obtain the corresponding probability atj, which is calculated as follows [15]: (8)

The context vector ci of time step i is obtained by weighting the sum of atj as follows [15]: (9)

3.3 CNN

CNNs [16] capture local information and reduce dimensionality by one-dimensional (1D) convolution and pooling operations. The structure of the CNN for text classification is shown in Fig 3. In the convolution layer, features are extracted with the help of various filters. Intermediate procedures are applied between the convolution layer and the pooling layer to make features nonlinear with the help of linear unit activation functions. In the pooling layer, these feature graphs are reduced in size, reducing the computational effort of subsequent layers and displaying important features more efficiently.

4 Construction of feature sets

4.1 Domain name feature set

As shown in Fig 4, a complete domain name includes the Top-Level Domain (TLD), the Second-Level Domain (SLD), and so on, which can reflect the category or attribution of the website or the specific purpose of the domain name. Therefore, the domain name features shown in Table 1 are obtained.

4.2 Whois feature set

Whois domain name database stores the information of all registered domain names. Whois information can be used to check the availability of domain names, identify trademark infringement, and hold domain name registrants accountable. Rich information such as registrant, registration time, DNS and so on can be obtained through Whois. As shown in Table 2, the Whois features are obtained and converted into int data.

4.3 N-gram feature set

4.3.1 Domain name whitelist substring N-gram feature set.

Alexa ranking represents the world ranking of website popularity. Alex top 1 million stores the top 1 million websites in order of popularity, therefore, the top ranked websites in Alex top 1 million are usually the ones with higher credibility. Therefore, top 100,000 domain names of Alex top 1 million are selected to build the domain name whitelist substring feature set (DNWSFS), and the number of substrings of each domain name to be tested appared in the DNWSFS is obtained. The specific process is as follows.

Step one, remove the special characters of the 100,000 domain names of Alex top 1 million, and split them into substrings through the N-gram method. N-gram slides from the left to the right of the domain name using a sliding window of length N. Taking “domainname” as an example, the process of 4-gram is shown in Fig 5. According to the empirical value, the values of N are set to 3–8, and the substrings of 3-gram to 8-gram of the 100,000 domain names in Alex top 1 million are obtained respectively.

Step two, count the number of occurrences of each substring in step one, so as to build the DNWSFS. DNWSFS contains all the substrings and the number of times they occur.

Step three, remove the characteristic characters of each domain name to be tested, and obtain the substrings of 3-gram to 8-gram of them.

Step four, query the number of occurrences in DNWSFS of the 3-gram to 8-gram substrings of the domain name to be tested.

According to the above process, the domain name whitelist substring N-gram features are obtained as shown in Table 3.

thumbnail
Table 3. Obtained domain name whitelist substring N-gram features.

https://doi.org/10.1371/journal.pone.0279866.t003

4.3.2 Domain name blacklist substring N-gram feature set.

The 360netlab dataset stores DGA domains screened in real time by researchers of 360 security company, containing domain names of 58 DGA families. In the same way as constructing the DNWSFS, 100,000 domain names of 360Netlab are selected to build the domain name blacklist substring feature set (DNBSFS), and the domain name blacklist substring N-gram features are obtained as shown in Table 4.

thumbnail
Table 4. Obtained domain name blacklist substring N-gram features.

https://doi.org/10.1371/journal.pone.0279866.t004

5 Method

The publicly available data are selected in this artical and all data can be publicly accessed by everyone. The domain names of Alex top can be downloaded from: https//www.alexa.com/. The DGA domain names can be downloaded from: https://data.netlab.360.com/.

DGA detection model based on feature extraction and Domain Center construction, FEDCC, is constructed in this paper, and the structure of the model is shown in Fig 6. Firstly, as described in Section 4, the domain name features, Whois features and N-gram features of the input layer are extracted to form into a feature vector of domain name. Secondly, the feature vector is input into BiLSTM network to obtain the hidden vector. Thirdly, the Attention network is used to assign different degrees of attention to the hidden vector. Fourthly, the feature vector and the hidden vector output by the Attention network are added as the result of the skip connect network. Fifthly, the result of the skip connect network is input into the CNN network. 1D convolution is used to further extract hidden relationships, and max pooling is used to reduce dimension. Sixthly, the output of CNN network is input into the fully connected network, and the final classification result is obtained through the softmax function. Finally, The hidden vectors of all samples through the CNN network are input to the Domain Center, and the mean vectors of different categories of samples are further obtained. When a new domain name is input, it is only necessary to calculate the Euler distances between the hidden vector obtained by the deep learning model of the domain name and the mean vectors stored in the Domain Center to quickly achieve classification. The specific process is as follows.

5.1 BiLSTM network

Unlike LSTM [34], which can only use the information before the current time node, BiLSTM [14] can use the forward and backward timing characteristics at the same time. As described in Section 4, feature vector V = {v1, v2, ⋯, vn} with length n is obtained. Then, V is input into the forward and backward networks of BiLSTM to obtain the forward and backward hidden vectors , : (10) (11)

Then, the forward and backward hidden vectors are combined to obtain the bidirectional hidden vector h: (12)

5.2 Attention network

Attention mechanism assigns different attentions to different features. Specifically, firstly, input h into the Attention network to obtain the hidden representation x: (13) Where W represents the weight and b represents the bias term. Secondly, the attentions of the features are calculated according to the similarity between y and x, where y is the randomly initialized feature vector. After the attentions are obtained, the softmax function is used for normalization, and then the weight vector r is obtained: (14)

Finally, comment vector f containing all feature concerns is obtained through the weighted sum of r: (15)

5.3 Skip connect network

As the number of network layers deepens, the objective function is more and more likely to fall into local optimal solutions, while the problems of weight matrix degradation and gradient disappearance becomes more serious. Since the skip connect network [35] directly takes the input data as part of the output, it can alleviate the above problems well. Add V and f to obtain the result Vsk of skip connect: (16)

5.4 CNN network

CNN network includes 1D convolution and max pooling. First, input Vsk into the 1D convolution network to further extract hidden relationships in hidden vector: (17)

Where Wcnn represents the weight and bcnn represents the bias term. Then, input t into the max pooling network to obtain the pooling operation result g: (18)

5.5 Output

Input g into the fully connected network to obtain the final classification label through softmax function: (19)

5.6 Domain Center

Since there are a large number of DGA and benign domain names, when a new domain name is input, it is necessary to determine whether the domain name is a DGA domain by calculating the distances between the feature vector of the domain name and the feature vectors of all domain names in the training set. Therefore, in order to reduce the DGA detection time on the validation set, the Domain Center is constructed. The Domain Center stores the mean vectors of the hidden vectors obtained by the deep learning model for all samples of different categories. When a new domain name is input, it only needs to judge the Euler distances [18] between the hidden vector of the domain name and the mean vectors stored in the Domain Center, thereby realizing fast domain name classification. The original and improved DGA detection methods are shown in Figs 7 and 8, respectively.

The processes of the original and Domain Center methods are as follows. First, all samples in the training set X = {X1, X2, ⋯, Xl} are stored in the Domain Center, l denotes the number of samples, Xi = {xi1, xi2, ⋯, xin}, n is the length of the feature dimension. Then, suppose that the feature vector of the domain name to be detected is Y = {y1, y2, ⋯, yn}. The process of original DGA detection method is as follows. (20)

It can be seen that the time complexity of traditional DGA detection is o(l × n). The Domain Center is proposed to reduce detection time and the processes of Domain Center are as follows.

First, the benign mean vector B and DGA mean vectors D1-Dc stored in the Domain Center are calculated, c denotes the number of categories of DGA domains: (21)

Where lb denotes the number of benign domains. Xlabel = 0 denotes the benign domains. li denotes the number of DGA domains labeled i, i = 1, 2, ⋯, c indicates the class of the DGA. The improved similarity is calculated as follows: (22)

It can be seen that the time complexity of the Domain Center is o(c+ 1), which will greatly reduce the time of domain name classification.

6 Experiment

6.1 Dataset

DGA domains: 58,000 domain names are selected from 360netlab, which exclude 100,000 domain names used to build DNBSFS. The dataset contains all 58 DGA families such as tordwm, dircrypt and fobber, and each DGA family contains 1,000 domain names.

Benign domain names: 58,000 domain names are selected from Alex top 1 million as benign domain names, which exclude 100,000 domain names used to build DNWSFS.

6.2 Baseline methods

WSDL [7]: This model proposes a set of heuristic algorithm, which automatically marks the domain names monitored in the real traffic through the weakly supervised deep learning algorithm.

HDNN [8]: This model adopts an improved parallel CNN architecture with multi-scale convolution kernels to extract multi-scale local features from domain names. The framework also includes a BiLSTM architecture based on self-attention, which can extract bi-directional global features with Attention mechanism from domain names.

DNSML [36]: This model selects five characteristics of domain names and uses Random Forest to detect DGA domains.

DBD [37]: This model obtains the implicitly extracted statistical features and classifies domain names through deep learning architecture.

CNN-BiLSTM [29]: This model further maximizes the detection performance by using the CNN + BiLSTM integrated model, and allows the model to learn local and global informations in the domain sequence at the same time.

DGA-RNN [31]: This model constructs the feature set of domain names and uses simple RNN to detect DGA domains.

N-CBDC [32]: This model combines N-gram and deep RNN for DGA classification.

ATT-CNN-BiLSTM [33]: This model first uses CNN and BiLSTM to extract the hidden feature of the feature vector, and then uses the Attention network to assign different weights to the hidden feature.

LA-BM07 [28]: This model constructs two deep learning models by combining the LSTM network and Attention layer. The model can judge whether a domain name is benign or malicious, and can identify the DGA families of malicious domain names.

HAGD [30]: This model constructs three feature extraction methods adapted to the length of the domain name. The model adopts different detection methods for different lengths of domain names.

The above baseline methods are reproduced according to the references, and the optimal results of each baseline method are obtained through parameter tuning.

6.3 Experimental environment

Two layers of BiLSTM and three layers of CNN are included in this paper. The neurons of the two BiLSTM networks are 128 and 256, respectively, and 0.3 dropout is used at the end of the second layer. Each layer of CNN consists of a convolutional layer and a max pooling layer, and a droupout of 0.3 is used after each pooling layer. The sizes of the three-layer convolution kernel are all 3 × 3, and the numbers are 128, 64, 64 respectively. The sizes of the pooling template are all 3. The epoch is 150, the batch size is 32, and the learning rate is 0.00001.

6.4 Evaluation criteria

Accuracy, Precision, Recall and F1 are used as evaluation criteria, which are calculated as follows. (23)

Where P and N respectively represent the number of DGA and benign domains, TP and TN respectively represent the number of correctly predicted DGA and benign domains, and FP and FN respectively represent the number of incorrectly predicted DGA and benign domains.

6.5 Experimental results

FEDCC and baseline methods are applied to DGA and benign domain name datasets respectively, and the DGA detection results are shown in Table 5. It can be seen from Table 5 that FEDCC obtains the best classification Accuracy, Precision, Recall and F1, which are 0.9713, 0.9627, 0.9765 and 0.9696, respectively. In the baseline methods, the Accuracy, Precision, Recall and F1 of HAGD, LA-BM07, ATT-CNN-BiLSTM and HDNN are greater than 0.9, the Accuracy, Precision, Recall and F1 of CNN-BiLSTM, DGA-RNN and N-CBDC are between 0.8–0.9, and these of the remaining methods are less than 0.8. Among them, DNSML obtains the worst Accuracy, Precision, Recall and F1, which are 0.2394, 0.2368, 0.2893 and 0.2636 lower than FEDCC respectively. HAGD obtains the best Accuracy, Precision, Recall and F1, which are 0.0191, 0.0166, 0.0264 and 0.0215 lower than FEDCC respectively. In terms of classification time, FEDCC obtains the optimal classification time, which is 1.3s. In the baseline methods, ATT-CNN-BiLSTM obtains the longest classification time, which is 113.2s, 87 times that of FEDCC. DNSML obtains the shortest classification time, which is 15.7s, 12 times that of FEDCC.

Although most baseline methods use deep learning models as classifiers, and some baseline methods use a variety of deep learning models to build complex neural networks, FEDCC still has greatly improved the DGA detection results. We analyze the reasons. WSDL, DNSML and DBD use simple deep learning model with fewer features, which obtain the worst Accuracy, Precision, Recall and F1. Although CNN-BiLSTM, DGA-RNN and N-CBDC jointly use multiple deep learning models, since the features extracted by them are still limited, the Accuracy, Precision, Recall and F1 of them are not that ideal. HDNN, ATT-CNN-BiLSTM, LA-BM07, and HAGD use more complex deep learning models and features that are not rich enough. Although their Accuracy, Precision, Recall and F1 are greater than 0.9, they are still lower than that of FEDCC. In addition to the domain name features, FEDCC not only obtains the Whois features, but also obtains the N-gram features by constructing the DNWSFS and DNBSFS. Through rich feature acquisition and complex deep learning model design, FEDCC greatly improves the Accuracy, Precision, Recall and F1. In addition, the time complexity is reduced from O(l × n) to O(c + 1) by constructing the Domain Center, which greatly reduces the classification time.

6.6 Model analysis

6.6.1 Receiver Operating Characteristic (ROC) curves.

The DGA detection dataset constructed in Section 6.1 belongs to the binary classification balance dataset. Therefore, in order to better interpret the DGA detection results of FEDCC and baseline methods, the ROC curves of the detection results of each comparison models are drawn in Fig 9, and the Area Under the Curve (AUC) values of each model are calculated. It is easy to see from the ROC curves and AUC values that FEDCC is significantly better than all baseline methods.

thumbnail
Fig 9. ROC curves of DGA detection results of comparison models.

https://doi.org/10.1371/journal.pone.0279866.g009

6.6.2 Classification results of different DGA families.

In Section 6.1, the domain names of 58 DGA families are uniformly assigned to the DGA domain dataset, and the domain name naming methods of different DGA families may be quite different. For example, the domain names of the Banjori, Chinad, Conficker families are constructed based on random characters, while the domain names of the Bigviktor, Matsnu families are constructed based on dictionaries. To further verify the classification effect of FEDCC on different DGA families, the following experiments are conducted. 1000 benign domain names and the domain names of 58 DGA families in Section 6.1 are selected to form the new domain name datasets. FEDCC and baseline methods are applied to these datasets to obtain the detection results of 58 DGA families, and the classification Accuracy, Precision, Recall and F1 are shown in Figs 1013, respectively. From Figs 1013, it can be seen that the classification results of FEDCC in most DGA families are far better than the baseline methods. Therefore, FEDCC can not only accurately detect the DGA domains, but also accurately judge which DGA family the DGA domain belongs to.

thumbnail
Fig 10. The classification Accuracy of the comparison models for 58 DGA families.

https://doi.org/10.1371/journal.pone.0279866.g010

thumbnail
Fig 11. The classification Precision of the comparison models for 58 DGA families.

https://doi.org/10.1371/journal.pone.0279866.g011

thumbnail
Fig 12. The classification Recall of the comparison models for 58 DGA families.

https://doi.org/10.1371/journal.pone.0279866.g012

thumbnail
Fig 13. The classification F1 of the comparison models for 58 DGA families.

https://doi.org/10.1371/journal.pone.0279866.g013

6.6.3 Importance analysis of each component.

A feature set with rich features, including the domain name features, Whois features and the N-gram features is constructed in this paper. In addition, a deep learning model based on BiLSTM, Attention, skip connect and CNN is built. Additionally, the Domain Center is built for fast DGA detection. Through the previous experiments, it can be seen that the model constructed in this paper greatly promotes the DGA detection results. The following experiments are conducted to verify the influence of each component. The relative improvement ratio Δ, the classification accuracy and the classification time are used as the evaluation metrics: Where ACCFEDCC represents the DGA detection accuracy of FEDCC, and ACC represents the DGA detection Accuracy after changing each component. The results are shown in Table 6. It can be seen from Table 6 that when any component of FEDCC is removed, the detection Accuracy is reduced, and the removal of a feature set reduces the Accuracy of the model to a greater extent than the removal of the Domain Center or a component of the deep learning model. By analyzing the three feature sets, it can be seen that Whois features has the greatest impact on the model, followed by the N-gram features and the domain name features. By analyzing the four components of the deep learning model, it can be seen that BiLSTM has the greatest impact on the model, followed by the Attention, skip connect and CNN.

thumbnail
Table 6. DGA detection results of different variant models.

https://doi.org/10.1371/journal.pone.0279866.t006

Although the Domain Center is less important in classification accuracy than the feature set, when it is removed, classification time increases substantially. Therefore, the Domain Center can greatly reduce classification time while improving classification accuracy.

6.6.4 Importance analysis of different features.

Domain name features, Whois features and N-gram features are obtained to construct a feature vector with a length of 35. Through the above experiments, we know that the construction of the feature vector plays a great role in promoting the DGA detection results. In order to further study the influence of different features on the detection results, the importance of different features is analyzed through the following experimentS. We refer to the official document of LightGBM to calculate the importanceS of 35 features, and the calculation formulas are as follows: Where c1 and c2 are the number of objects in each leaf respectively, v1 and v2 are the formula values in the left and right leaves respectively. The importance results of 35 features are shown in Fig 14. It can be seen from Fig 14 that most of the Whois features are more important than the domain name features and the N-gram features, which further verifies the experimental results in Section 6.6.3. Analyzing the reason, Whois contains entity information such as domain name registrar and server, and the extraction of these information greatly improves the DGA detection results.

6.6.5 Analysis of the Domain Center.

The Domain Center is proposed to quickly detect whether the newly entered domain name is a DGA domain through Euler distance [18]. Since there are many methods to calculate the vector similarity, the following experiments are carried out to explain why Euler distance is used in this paper. Besides Euler distance, Manhattan distance [38], cosine similarity [39], Minkowski distance [40] and Chebyshev distance [41] are also selected, and the experimental results of different similarity methods are shown in Table 7. It can be seen that although Manhattan distance uses less time and Chebyshev distance achieves the highest Precision, Euler distance achieves the overall optimal results. Therefore, Euler distance is selected to quickly classify DGA domains.

thumbnail
Table 7. DGA detection results of different similarity methods.

https://doi.org/10.1371/journal.pone.0279866.t007

The mean vectors of the hidden vectors of the datasets in Section 6.1 are stored in the Domain Center. The difference in the number of samples in the training set will cause the difference in the mean vectors stored in the Domain Center, which in turn affects the DGA detection results. The following experiments verify the relationship between the DGA detection results and the amount of samples in the training set, and the experimental results are shown in Table 8. It can be seen that as the number of samples in each category increases, the DGA detection results gradually improve. When the number of samples in each category exceeds 1,000, the increase of DGA detection results with the increase of the number of samples becomes smaller and smaller, but the training time is proportional to the amount of samples. Therefore, the number of samples for each category is set to 1000 in this paper.

thumbnail
Table 8. DGA detection results of different number of samples.

https://doi.org/10.1371/journal.pone.0279866.t008

7 Conclusion

Due to the limited features that the traditional DGA detection model can extract, the DGA detection results are not that ideal. In order to solve the above problem, rich feature set, including the domain name features, the Whois features and the N-gram features, and a deep learning model based on BiLSTM, Attention and CNN are constructed for DGA detection. In addition, the Domain Center is built to reduce the DGA detection time. Multiple comparative experiment results prove that the proposed model not only gets the best Accuracy, Precision, Recall and F1, but also greatly reduces the classification time of newly entered domain names.

However, the model built in this paper is based on passively acquired data and cannot actively detect DGA domains. In order to better realize the task of DGA detection and governance, our main task in the future is to develop a model to actively detect DGA, so as to actively prevent the infringement of DGA domains.

References

  1. 1. Sood Aditya K and Zeadally Sherali. A taxonomy of domain-generation algorithms. IEEE Security & Privacy. 2016;14(4):46–53.
  2. 2. Antonakakis Manos and April Tim and Bailey Michael and Bernhard Matt and Bursztein Elie and Cochran Jaime et al. Understanding the mirai botnet. In: 26th USENIX security symposium (USENIX Security 17); 2017. p.1093–1110.
  3. 3. Zhao Dan and Li Hao and Sun Xiuwen and Tang Yazhe. DOLPHIN: Phonics based Detection of DGA Domain Names Computer-aided design. 2021;01–06.
  4. 4. Shahzad Haleh and Sattar Abdul Rahman and Skandaraniyam Janahan. DGA domain detection using deep learning. In: 2021 IEEE 5th International Conference on Cryptography, Security and Privacy (CSP); 2021. p. 139-143.
  5. 5. Zeng Yuwei and Yun Xiaochun and Chen Xunxun and Li Boquan and Tsang Haiwei and Wang Yipeng et al. Finding disposable domain names: A linguistics-based stacking approach. Computer Networks. 2021;184:107642.
  6. 6. Ravi Vinayakumar and Alazab Mamoun and Srinivasan Sriram and Arunachalam Ajay and Soman KP. Adversarial defense: DGA-based botnets and DNS homographs detection through integrated deep learning. IEEE transactions on engineering management. 2021.
  7. 7. Yu Bin and Pan Jie and Gray Daniel and Hu Jiaming and Choudhary Chhaya and Nascimento Anderson CA et al. Weakly supervised deep learning for the detection of domain generation algorithms. IEEE Access. 2019;7:51542–51556.
  8. 8. Yang Luhui and Liu Guangjie and Dai Yuewei and Wang Jinwei and Zhai Jiangtao. Detecting stealthy domain generation algorithms using heterogeneous deep neural network framework. IEEE Access. 2020;8:82876–82889.
  9. 9. Chen Yijing and Pang Bo and Shao Guolin and Wen Guozhu and Chen Xingshu. DGA-based botnet detection toward imbalanced multiclass learning. Tsinghua Science and Technology. 2021;26(4):387–402.
  10. 10. Lu Chaoyi and Liu Baojun and Zhang Yiming and Li Zhou and Zhang Fenglu and Duan Haixin et al. From WHOIS to WHOWAS: A Large-Scale Measurement Study of Domain Registration Privacy under the GDPR. In: Proceedings of the 2021 Network and Distributed System Security Symposium–NDSS; 2021. p. 21.
  11. 11. Zhao Hong and Chang Zhaobin and Bao Guangbin and Zeng Xiangyan. Malicious domain names detection algorithm based on N-gram. Journal of Computer Networks and Communications. 2019:1–8.
  12. 12. Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization; 2014. arXiv:1409.2329.
  13. 13. Canziani Alfredo and Paszke Adam and Culurciello Eugenio. An analysis of deep neural network models for practical applications; 2016. arXiv:1605.07678.
  14. 14. Wang Shouxiang and Wang Xuan and Wang Shaomin and Wang Dan. Bi-directional long short-term memory method based on attention mechanism and rolling update for short-term load forecasting. International Journal of Electrical Power & Energy Systems. 2019;109:470–479.
  15. 15. Niu Zhaoyang and Zhong Guoqiang and Yu Hui. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48–62.
  16. 16. Albawi Saad and Mohammed Tareq Abed and Al-Zawi Saad. Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET); 2017. p. 1-6.
  17. 17. Iandola Forrest N and Han Song and Moskewicz Matthew W and Ashraf Khalid and Dally William J and Keutzer Kurt. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size; 2016. arXiv:1602.07360.
  18. 18. Remlinger Carl and Mikael Joseph and Elie Romuald. Conditional loss and deep euler scheme for time series generation. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2022. p. 8098–8105.
  19. 19. Beaman Craig and Barkworth Ashley and Akande Toluwalope David and Hakak Saqib and Khan Muhammad Khurram. Ransomware: Recent advances, analysis, challenges and future research directions. Computers & Security. 2021;111:102490. pmid:34602684
  20. 20. Botacin Marcus and Ceschin Fabricio and Sun Ruimin and Oliveira Daniela and Grégio André. Challenges and pitfalls in malware research. Computers & Security. 2021;106:102287.
  21. 21. Yang Donghui and Li Zhenyu and Jiang Haiyang and Tyson Gareth and Li Hongtao and Xie Gaogang et al. A deep dive into DNS behavior and query failures. Computer Networks. 2022;109131.
  22. 22. Tuan Tong Anh and Anh Nguyen Viet and Long Hoang Viet. Assessment of Machine Learning Models in Detecting DGA Botnet in Characteristics by TF-IDF. In: 2021 IEEE International Conference on Machine Learning and Applied Network Technologies (ICMLANT); 2021. p. 1-5.
  23. 23. Štampar Miroslav and Fertalj Krešimir. Applied machine learning in recognition of DGA domain names. Computer Science and Information Systems. 2022;19(1):205–227.
  24. 24. Soleymani Ali and Arabgol Fatemeh. A Novel Approach for Detecting DGA-Based Botnets in DNS Queries Using Machine Learning Techniques. Journal of Computer Networks and Communications. 2021;2021:5–12.
  25. 25. Chin Tommy and Xiong Kaiqi and Hu, Chengbin and Li Yi. A machine learning framework for studying domain generation algorithm (DGA)-based malware. In: International Conference on Security and Privacy in Communication Systems; 2018. p. 433–448.
  26. 26. Li Yi and Xiong Kaiqi and Chin Tommy and Hu Chengbin. A machine learning framework for domain generation algorithm-based malware detection. IEEE Access. 2019;7:32765–32782.
  27. 27. Baruch Moran and David Gil. Domain generation algorithm detection using machine learning methods. In: Cyber security: power and technology; 2018. p. 133–161.
  28. 28. Tuan Tong Anh and Long Hoang Viet and Taniar David. On Detecting and Classifying DGA Botnets and their Families. Computers & Security. 2022;113:102549.
  29. 29. Namgung Juhong and Son Siwoon and Moon Yang-Sae. Efficient Deep Learning Models for DGA Domain Detection. Security and Communication Networks. 2021;2021:10.
  30. 30. Liang Jianbing and Chen Shuhui and Wei Ziling and Zhao Shuang and Zhao Wei. HAGDetector: Heterogeneous DGA Domain Name Detection Model. Computers & Security. 2022;102803.
  31. 31. Lison Pierre and Mavroeidis Vasileios. Automatic detection of malware-generated domains with recurrent neural models; 2017. arXiv:1709.07102.
  32. 32. Xu Congyuan and Shen Jizhong and Du Xin. Detection method of domain names generated by DGAs based on semantic representation and deep neural network. Computers & Security. 2020;85:77–88.
  33. 33. Ren Fangli and Jiang Zhengwei and Wang Xuren and Liu Jian. A DGA domain names detection modeling method based on integrating an attention mechanism and deep neural network. Cybersecurity. 2020;3(1):1–13.
  34. 34. Hochreiter Sepp and Schmidhuber Jürgen. Long short-term memory. Neural computation. 1997;9(8):1735–1780. pmid:9377276
  35. 35. Ding Yadong and Wu Yu and Huang Chengyue and Tang Siliang and Wu Fei and Yang Yi and et al. NAP: Neural architecture search with pruning. Neurocomputing. 2022;477:85–95.
  36. 36. Mao Jian and Zhang Jiemin and Tang Zhi and Gu Zhiling. DNS anti-attack machine learning model for DGA detection. Physical Communication. 2020;40:101069.
  37. 37. Vinayakumar R and Soman KP and Poornachandran Prabaharan and Alazab Mamoun and Jolfaei Alireza. DBD: Deep learning DGA-based botnet detection. In: Deep learning applications for cyber security; 2019. p. 127–149.
  38. 38. Sun Yongjian and Li Shaohui and Wang Yaling and Wang Xiaohong. Fault diagnosis of rolling bearing based on empirical mode decomposition and improved manhattan distance in symmetrized dot pattern image. Mechanical Systems and Signal Processing. 2021;159:107817.
  39. 39. Xia Peipei and Zhang Li and Li Fanzhang. Learning similarity with cosine similarity ensemble. Information Sciences. 2015;307:39–52.
  40. 40. Groenen Patrick JF and Jajuga Krzysztof. Fuzzy clustering with squared Minkowski distances. Fuzzy Sets and Systems. 2001;120(2):227–237.
  41. 41. Klove Torleiv and Lin Te-Tsung and Tsai Shi-Chun and Tzeng Wen-Guey. Permutation arrays under the Chebyshev distance. IEEE Transactions on Information Theory. 2010;56(6):2611–2617.