Two-layer detection framework with a high accuracy and efficiency for a malware family over the TLS protocol

The transport layer security (TLS) protocol is widely adopted by apps as well as malware. With the geometric growth of TLS traffic, accurate and efficient detection of malicious TLS flows is becoming an imperative. However, current studies focus on either detection accuracy or detection efficiency, and few studies take into account both indicators. In this paper, we propose a two-layer detection framework composed of a filtering model (FM) and a malware family classification model (MFCM). In the first layer, a new set of TLS handshake features is presented to train the FM, which is devised to filter out a majority of benign TLS flows. For identifying malware families, both TLS handshake features and statistical features are applied to construct the MFCM in the second layer. Comprehensive experiments are conducted to substantiate the high accuracy and efficiency of the proposed two-layer framework. A total of 96.32% of benign TLS flows can be filtered out by the FM with few malicious TLS flows being discarded provided the threshold of the FM is set to 0.01. Moreover, a multiclassifier is selected to construct the MFCM to provide better performance than a set of binary classifiers under the same feature set. In addition, when the ratio of benign and malicious TLS flows is set to 10:1, the detection efficiency of the two-layer framework is 188% faster than that of the single-layer framework, while the average detection accuracy reaches 99.45%.


8
::: The : TLS protocol can guarantee the security of users' access to the Internet, ; 9 however, it also facilitates malware to establish command and control (C&C) channels. 10 Malware can briskly pass through the firewall via TLS-based communication technology, 11 and the encrypted payload makes it difficult to analyze. Malicious TLS traffic is also 12 showing ::: has :::: also :::::: shown an increasing trend in recent years. As portrayed in Cisco's 13 report in 2018 [?], 33% of malware utilize :::::: utilizes :::: the TLS protocol to establish C&C 14 communication. In addition, MITRE ATT&CK [?] records ::: has :::::::: recorded : a series of 15 cyber attacks exposed in the past few years, and the number of attacks using 443 ports 16 to establish C&C communication accounts for 66.67%. Therefore, the wide application 17 of ::: the : TLS protocol brings a big :::: large : challenge to achieve the purpose of identifying 18 malicious TLS flows with superior :::::: suitable : efficiency.

25
Facing this sophisticated and untrusted communication environment, this paper 26 proposes a two-layer detection framework with : a rapid rate and high preciseness 27 :::::::: precision based on the supervised learning algorithm. Current researches are either 28 focusing on the improvement of :::::: studies ::::: focus ::: on :::::: either ::::::::: improving :::: the detection Few : studies discuss how to improve ::: the detection efficiency for a two-layer detection 31 framework without affecting ::: the : detection accuracy. Indeed, as long as a majority of 32 benign TLS flows are excluded quickly, both detection indexes can be guaranteed.

52
3) This paper proposed a two-layer framework to refine the efficacy of detecting TLS 53 flows, in which the first layer applies a binary classifier to filter out benign TLS flows ; 54 and the second layer employs a multi-classifier :::::::::::: multiclassifier : to identify the malware 55 family of TLS flows. Experiments show that our two-layer framework can greatly 56 improve the detection efficiency, while the detection accuracy is also guaranteed.
[?] design 99 ::::::: designed : a two-layer detection model , and focused on introducing a tree-based feature 100 transformation algorithm to obtain more effective features. The main function of the 101 first layer is ::: was : also to filter out benign packets, but there is ::: was no detailed 102 description of the filtering mechanism, and they did not evaluate whether the method 103 they proposed could improve the detection efficiency. studies :::::::: depicted :::::: above ::::: focus on how to refine the detection accuracy and seldom 110 discussing :::::: discuss : the impact on detection efficiency.

150
Problem statement 151 For detecting ::: To :::::: detect malicious TLS flows efficiently, this paper proposes a two-layer 152 detection framework. The first layer is designed to filter out benign network traffic; the 153 second layer is utilized to identify malware families of TLS flows. Similar detection 154 frameworks are used in [?] and [?], but in their methods, neither any description of the 155 filtering mechanism nor the efficiency evaluation is mentioned. Simultaneously, the TLS 156 flow is a kind of encrypted network traffic and cannot be filtered by simply matching 157 the signature. For the proposed two-layer detection framework, in addition to the extra 158 consuming ::::::::::: consumption : time of the Filtering Model :::::: filtering :::::: model, the traversal times 159 of the two-layer framework are also more than that of the single-layer one ::::::::: framework,

176
In a real Gigabit :::::: gigabit : network environment, hundreds of TLS flows generated in 177 every minute make it costly to identify malware families of TLS flows in real time.

188
When a new TLS flow is imported into this detection framework, the detection process 189 is as follows. filtering :::::: model; :: if :::: this :::::: model :::::::: identifies : it as a benign TLS flow, it is directly discarded 192 and no longer put into the next layer; if being classified as a potentially malicious TLS 193 flow, it passes to the next layer for further identification about which malware family it 194 belongs to. Through this process, one can speculate that the TLS flow which :::: that is not 195 discarded by the Filtering Model ::::::: filtering :::::: model : may contain both malicious TLS flows 196 and benign TLS flows. But :::::::: However, : compared to the number of flows in the first layer, 197 the number of benign TLS flows in the second layer will be much lessthan that in the 198 first layer, thus :: is ::::: much :::: less; ::::: thus, : the detection efficiency can be improved.

199
For making ::: To ::::: make : the two-layer framework more efficient, it requires that the 200 consuming time of the :::: time ::::::::: consumed ::: by ::: the : first layer is :::: must ::: be less than that of the 201 second layer, ; : otherwise, the two-layer framework would reach the opposite destination. 202 In this section, an inequation ::::::::: inequality : is used to infer the condition with superior 203 efficiency by the mathematical calculation concerning the consuming time of :::: time 204 ::::::::: consumed :: by : the two models, respectively. If our method is more efficient , which 205 means the time overheadof our method is less. Considering the following inequation.

385
Firstly :::: First, there are many TLS flow :::: flows : without the entire TLS handshake process 386 because of some optimization schemes : , : such as session tickets. However, when the 387 connection to the server occurs for the first time or when the session ticket time runs 388 out, the entire TLS handshake process will be : is : required to connect to the server.

421
Since we get :::::: obtain : 705 TLS Handshake ::::::::: handshake : features, the feature dimension 422 needs to be reduced before the training model. Based on ::: the : information gain algorithm 423 mentioned in the previous section, we can calculate the information gain value (IGV) 424 for each feature , and select candidate feature sets based on the IGV. The detailed 425 process is presented in Algorithm 1. Modified Wrapper ::: The ::::::::: modified :::::::: wrapper method 426 with a backward selection strategy is used to select the best feature subset. The 427 information gain of each feature should be calculated in advance. IG(F i ) represents the 428 result of information gain for feature subset F i , F i represents the ith feature subset, : and 429 F 0 represents the original feature set. Accuracy (ACC ) and false positive rate (FPR ) 430 ::: The ::::: ACC :::: and ::::: FPR : can be calculated by the classifier. X labeled represents the labeled 431 benign samples and malicious samples.

9:
if ACC Fi < ACC Fi−1 and F P R Fi > F P R Fi−1 then 10: if BFS is NULL then 12: end if 17: end for 18: return BF S There are mainly three :::: three ::::: main : steps in Algorithm 1: 1) preparatory work (1-2); 433 2) calculating :: the : initial parameters based on classifier (3); 3) evaluating F i and 434 selecting the best features :::::: feature : subset (4-17). In step 3, ::: the backward selection 435 strategy is used to construct a feature subset (F i ), and the number of features in F i is 1 436 less than that in F i−1 . The features in which their ::: the IGV is 0 can be directly   in which the minimum IGV is equal :: to or greater than 0 to the feature subset : in : which 447 the minimum IGV is equal :: to : or greater than 0.004. number ::::::: ranging : from 2,000 to 20,000. The ratio of positive and negative samples is 1:1. 462 When the sample size is greater than ::::::: increases ::::: past 10,000, both the accuracy and false 463 Therefore, : we select the random forest algorithm to train our Filtering Model ::::::: filtering 482 ::::: model.

483
The contribution of features also can ::: can :::: also : be evaluated by the classifier based on 484 the random forest algorithm. The most important 20 :: 20 ::::: most :::::::::: important features are 485 shown in Table 4. The cipher suites occupy nearly a half, it ::: half, :::::: which : means that the 486 Client Cipher Suites ::::: client :::::: cipher :::::: suites used by benign applications and malware are 487 remarkably different since malware are more inclined : is :::::: tends to utilize simpler 488 algorithms to encrypt network traffic. It also can be seen that there ::::: There : are 7 489 features we newly proposed with ::::::: propose ::::: with : a : new tag in this paper, which 490 demonstrates the effectiveness of the features we proposed. forest algorithm to calculate the confusion matrix for each threshold.

518
Before training the models, it is necessary for one to select relevant features from ::: the 519 original 705 TLS handshake features and 664 statistical features. Nevertheless, ::: the 520 information gain algorithm can not ::::: cannot : be directly used to select features for 521 multi-class : a ::::::::: multiclass : sample set. The feature selection method we used here contains 522 two steps: 1) selecting relevant features for each binary classifier by utilizing :: the 523 information gain algorithm ; ::: and 2) utilizing the union of 10 feature sets selected from 524 10 binary classifiers as our feature set. The process of feature selection for each binary 525 classifier is the same as that in the Filtering Model ::::::: filtering :::::: model. As shown in Fig : . : 7, 526 the accuracy and false positive rate :::: ACC :::: and ::::: FPR of each binary classifier are 527 calculated among different feature sets.

528
After completing these two stepsmentioned above, we finally obtain 762 features : , 529 including 234 TLS handshake features and 528 statistical features : , : and use the random 530 forest algorithm to train our binary and multiple classifiers. We also use 10-fold 531 cross-validation to evaluate the performance of these two options. As shown in Table 6, 532 the performance index of these two options are demonstratedrespectively. It can be seen that the :::

533
The : overall performance of ::: the MC is slightly better than 534 that of :: the : BCs, and their average accuracies are 98.41% and 98.36%, : respectively.

544
features occupy a majority compared to statistical features (That is ::: that ::: is, : 16:4), so we 550 can conclude that the TLS handshake features are more important than statistical 551 features.

588
We also compare the time consumption at different ratios of benign and malicious 589 samples. At each ratio, we test a total of 10 times and calculate the average time 590 consumption. As shown in Fig. : 9, when the ratio is 1:1, the single-layer framework is 591 not much different from the two-layer framework. However, along with the increase : in 592 ::: the ::::::: number : of benign samples, the two-layer framework is more and more ::::::::::: increasingly 593 advantageous. When the ratio reaches to 10:1, the two-layer framework is nearly twice 594 as fast as the single-layer framework. In the real network environment, since benign 595 TLS flows account for the vast majority , : (the ratio is far more than 10:1, it is literally 596 grounded to apply : ), :::::::::: application ::: of the two-layer detection framework . : is :::: well :::::::: justified. 597 598 Summarily :: In ::::::::: summary, we demonstrate that the two-layer detection framework 599 needs to meet certain conditions to improve the detection efficiency of TLS flows. That 600 is: , : 1) the detection efficiency of the coarse classification model in the first layer must 601 be higher than that of ::: the : detection models in the second layer; 2) the ratio of flows 602 filtered by the first layer must satisfy Ineq.
(2). Otherwise, the improvement of 603 detection efficiency can not : in ::::::::: detection ::::::::: efficiency :::::: cannot : be guaranteed. 604 We also compare our method with the other 3 methods in ::::: other :::::::: methods :: in :::::: terms single-layer detection framework, and their method is more efficient than ours when the 610 sample ratio is not more than 2:1. However, it is of low-efficiency :::: their ::::::: method :: is ::: of 611 ::: low ::::::::: efficiency when the sample ratio is over 2:1. The reason could be that the number 612 of features they used is less than that in the second layer of our method but more than 613 that in the first layer of our method. Comar's method : is : based on a two-layer detection 614 framework, in which the first layer is also used to exclude benign flows. But :::::::: However, 615 the second layer consists of a set of 1-class SVM models to identify a specific malware 616 class, which means that a potential malicious flow needs to traverse all the models 617 before obtaining the classification result. Though the number of features is less than :: in 618 our method, ::: the : time consumption is always higher than ours. Chen's method proposed 619 a triple-layer detection framework, ; : the additional layer is the second layer : , which is 620 used to recognize :: the : attack type. That is, a potential malicious flow needs to be 621 classified twice, which will add :::: adds extra time for detection. Thus, : the efficiency of 622 Chen's method is always less than ours.

633
The TLS protocol as a kind of cryptographic protocol is increasingly employed to 634 establish the C&C channel by malware. The identification of malicious TLS flows is 635