Figures
Abstract
Binary code similarity detection plays a crucial role in various applications within binary security, including vulnerability detection, malicious software analysis, etc. However, existing methods suffer from limited differentiation in binary embedding representations across different compilation environments, lacking dynamic high-level semantics. Moreover, current approaches often neglect multi-level semantic feature extraction, thereby failing to acquire precise semantic information about the binary code. To address these limitations, this paper introduces a novel detection solution called BinBcla. This method employs an enhanced pre-training model to generate instruction embeddings with dynamic semantics for binary functions. Subsequently, multi-feature fusion technique is utilized to extract local semantic information and long-distance global features from the code, respectively, employing self-attention to comprehend the structure information of the code. Finally, an improved cosine similarity method is employed to learn relationships among all elements of the distance vectors, thereby enhancing the model’s robustness to new sample functions. Experiments are conducted across different architectures, compilers, and optimization levels. The results indicate that BinBcla achieves higher accuracy, precision and F1 score compared to existing methods.
Citation: Jia Y, Yu Z, Hong Z (2024) Semantic aware-based instruction embedding for binary code similarity detection. PLoS ONE 19(6): e0305299. https://doi.org/10.1371/journal.pone.0305299
Editor: Asadullah Shaikh, Najran University College of Computer Science and Information Systems, SAUDI ARABIA
Received: March 2, 2024; Accepted: May 27, 2024; Published: June 11, 2024
Copyright: © 2024 Jia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All datasets files are available from the figshare database (DOI:https://doi.org/10.6084/m9.figshare.25778349.v1).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Owing to the non-open source nature of the source code of commercial software, malicious code, and legacy programs, conducting binary code similarity analysis on these software applications holds significant importance in various security applications. These applications include vulnerability search, malware detection, code clone detection, etc. The objective of binary code similarity analysis is to identify structurally and semantically similar code fragments within a target binary code repository based on known binary code snippets. However, binary code is typically represented as assembly code functions obtained through disassembling executable files. Unlike functions in high-level languages, understanding functions in binary code is often challenging. In real-world scenarios, variations in architectures, compilers, and optimization levels can result in substantial changes to binary code. Since binary code lacks a vocabulary of natural language semantics, as found in source code, extracting semantic information presents a challenging task. Before the advent of machine learning in this field, traditional methods relied heavily on control flow graph (CFG) of binary code. However, these methods, while demonstrating a certain level of effectiveness, fail to capture essential features of binary code and are characterized by complex feature extraction processes, high computational overhead, and poor scalability. Moreover, current approaches represent functions as CFG without conducting multi-level feature extraction, thereby failing to acquire precise semantic information about the binary code and resulting in lower accuracy. Binary code embedding represents an emerging approach in similarity analysis, employing neural network models to transform binary code into vector representations. This method not only captures the semantic information of binary code, but also facilitates quantitative analysis of the similarity between corresponding binary codes by computing distances between vectors. Effective code representation helps mitigate the impact of significant assembly code differences resulting from diverse compilation environments. Recent advancements in binary code similarity detection have incorporated pre-trained natural language processing (NLP) models. SAFE [1] utilized Word2Vec to represent binary functions as sequences of assembly instructions, capturing both syntax and semantic information of instructions through BiRNN. Asm2vec [2] generated embeddings for functions based on the paragraph-vector distributed-memory (PV-DM) model. PalmTree [3] introduced the first BERT-based instruction embedding model. UniASM [4] employed unsupervised learning and a pre-trained BERT-based model to generate instruction embeddings. Despite the diverse algorithms employed by these deep learning-based methods, they all adhere to the concept of code embedding.
To further enhance code representation and improve the performance of detection, we employ an enhanced BERT-based pre-training model. This model’s pre-training tasks include masked language model (MLM) and similar function prediction (SFP), specifically tailored for constructing a universal model for binary code representation. Instruction embedding module models assembly functions by treating assembly instructions as tokens, converting each instruction into an embedding vector, and considering binary functions as sentences composed of sequences of instruction vectors. Subsequently, the embedding vectors of instructions are inputted into local semantic module to extract distinct local semantic features. Bidirectional LSTM is utilized to extract features capturing long-distance dependencies, capturing sequential interaction relationships within the vector sequences. When aggregating into functions, self-attention is applied to learn dependencies between instructions, allowing the final function embedding to consider the hidden states of all instructions. Finally, improved cosine similarity and cross-entropy loss are employed to learn distance vectors, understanding relationships among all elements of the distance vectors. This fundamentally enhances the model’s robustness to new sample functions. We conduct binary code similarity detection experiments across different architectures, different compilers, and different optimization levels. The experimental results demonstrate that our novel detection solution, called BinBcla, outperforms previous approaches in terms of accuracy, precision and F1 score.
In summary, the proposed method makes the following contributions:
- We develop an instruction embedding module, based on an enhanced BERT architecture, to provide dynamic semantic embeddings for binary code instructions. It dynamically adjusts token embedding based on varying contextual information, addressing the challenge of handling polysemy that Word2Vec struggles with.
- We propose a multi-feature fusion technique that performs feature extraction at different levels on binary code, greatly enriching the semantic feature information of binary code.
- We propose an improved cosine similarity method that addresses the issue of traditional cosine similarity being unable to distinguish differences between data objects with proportionally changing values in various dimensions, thereby enhancing the performance of model.
- We conducted experiments on three tasks, and the results indicate that BinBcla achieves better performance compared to existing methods.
The remainder of this paper is organized as follows: Related work section discusses related work in this field. Methodology section introduces our methodology, including instruction embedding and multi-feature fusion. Experimental setup section introduces implementation details. Evaluation section introduces the results of experiments. In conclusion section, we conclude our method and direction for future research.
2. Related work
2.1 Traditional approaches
Before the application of machine learning, traditional techniques mainly involved dynamic analysis and static analysis. Dynamic analysis methods rely on the premise that binary code sharing similar behavior at runtime exhibits logical similarity. These methods assess binary code similarity by analyzing manually crafted dynamic features. Genius [5], based on the graph isomorphism of CG/CFG, employs graph matching algorithms to validate the equality of two basic blocks. Pewny [6] represents each vertex of CFG using expression trees, calculating the similarity between vertices based on the edit distance between corresponding expression trees. ESH [7] employs theorem-proving to calculate whether two basic blocks of function are equivalent. BinSim [8] performs dynamic slicing through system calls and then employs symbolic execution to assess the equivalence of binary code. BinGo [9] and Blex [10] initialize function context through random sampling values, collecting I/O values to determine the similarity of functions. The drawback of these methods is high computational cost and execution time, rendering them unsuitable for large-scale detection. Static analysis methods, based on differences in the structure of binary code, typically involve converting binary code into graphs and subsequently comparing the similarity of these graphs. Binslayer [11] employs the Hungarian algorithm to enhance graph matching, thereby improving the results for binary functions. BinSign [12] and Kam1n0 [13] use instructions or categorized operands as static features to compute the similarity of binary. BinSequence [14] compares the similarity of two functions by calculating the edit distance between their instruction. XMATCH [15] leverages the graphs and tree edit distance of expression trees for basic blocks to compute the similarity of binary functions. DiscovRE [16] optimizes CFG-based matching by pre-filtering and eliminating unnecessary matching pairs. TEDEM [17] introduces tree edit distance to detect basic block-level code similarity. These methods rely solely on the structure or syntactic features of binary code, neglecting semantic information between instructions, resulting in relatively low detection accuracy.
2.2 Learning-based approaches
In recent years, some research studies have applied deep learning. A common approach in these studies involves representing binary functions as numerical vectors through instruction embedding models, and then approximating the similarity between different binary functions using vector similarity [18]. These methods utilize Siamese neural networks to ensure that vectors of logically similar binary functions are closer in distance. Xu [19] proposed Gemini, which relies on graph embedding techniques to generate binary function embeddings through attributed CFG (ACFG) and Structure2vec. INNEREYE [20] and RLZ2019 [21] treat instructions as words and basic blocks as sentences, using Word2Vec to generate function embeddings and LSTM to learn block embeddings. Li [22] introduced a simple BiRNN-based Ins2vec embedding method. Ben [23] proposed a detection method based on Transformer and convolutional neural network (CNN). αDiff [24] utilizes TextCNN [25] to directly learn embeddings for binary functions from the raw bytes of the functions. DEEPBINDIFF [26] and Codee [27] employ neural networks to learn vector embeddings for instruction sequences. In the field of unsupervised learning, a notable solution for binary code similarity detection is Asm2Vec [2], which generates instruction sequences from CFG and employs unsupervised learning algorithms to generate embeddings for binary functions. Additionally, Massarelli [28] and Yu [29] investigated methods to learn embeddings for basic blocks of binary functions and utilized GNN to learn embeddings for ACFG of binary functions. Methods based on deep learning have exhibited high accuracy and computational efficiency. Leveraging these advantages, such approaches are well-suited for large-scale detection.
3. Methodology
3.1 Overview
To address the abovementioned challenges, we utilize an enhanced BERT-based instruction embedding model, in which assembly instructions are first transformed into embedding vectors. Subsequently, these vectors are inputted into a Siamese neural network with multi-feature fusion as subnetworks. A self-attention layer is incorporated into the Siamese neural network to obtain weight information corresponding to the instruction vectors. Based on the weights derived from the self-attention, the instruction embeddings are aggregated into a function embedding. Ultimately, a similarity score between the two functions is calculated using improved cosine similarity. The overall structure is illustrated in Fig 1.
3.2 Instruction embedding module
The feature extractor of BERT employs Transformer, preserving richer semantic information in word vector representations. This has demonstrated excellent performance across various NLP tasks. Instruction embedding module utilizes the encoder of the Transformer structure to build a neural network model consisting of a multilayer bidirectional encoder. Designed as a universal model for constructing binary code representations, instruction embedding module is capable of vector embedding for each instruction within a function; its pre-training tasks include MLM and SFP.
3.2.1 Masked language model.
Assume that ti represents a token, and the function instruction I = t1, t2, t3, ⋯, tn is composed of a sequence of tokens, as illustrated in Fig 2. For instruction I, the following process is applied. First, 15% of the tokens are randomly selected for replacement. Among the masked tokens, 80% are replaced with [MASK], 10% are replaced with another randomly chosen token from the vocabulary, and the remaining 10% are left unchanged. Subsequently, the Transformer is employed to learn predictions and output the probability of predicting a specific token ti = [MASK] through Softmax:
(1)
where
represents the prediction for ti, Θ(I)i denotes the i-th layer vector of the final layer of Transformer Θ when the function instruction I is used as the input, wi is the weight for i, and K is the number of possible labels for the ti.
3.2.2 Similar function prediction.
SFP processes each batch of function pairs during data processing, rather than a single function pair. As depicted in Fig 3, each sample of each batch represents a pair of similar functions, such as [CLS] F [SEP] F′ [SEP], where F and F′ denote similar functions. Swapping the two similar functions constructs [CLS] F′ [SEP] F [SEP], which is then appended to the original samples. Consequently, each batch contains an even number of function pairs.
The embedding for the k-th function in each batch is denoted as vk = [v1, v2, ⋯, vd], where d represents the size of the hidden layer. Subsequently, L2 normalization is applied to each element in the embedding:
(2)
The normalized function embedding vector, denoted as , is obtained. Batch normalization is then utilized to construct the embedding matrix
, where b is the size of batch.
To calculate the similarity between two functions, we take the dot product between the embedding matrix and its transpose
:
(3)
The matrix S, referred to as the similarity matrix, is defined such that each value represents the similarity between two functions. Smaller angles correspond to more similar vectors. To mitigate the impact of diagonal elements in the similarity matrix, all diagonal elements are set to negative infinity in this paper:
(4)
where Λ[+∞] represents the diagonal matrix. Each row of the similarity matrix is processed through a Softmax layer:
(5)
where sij represents the similarity between the i-th and j-th functions.
3.3 Multi-feature fusion
3.3.1 Local semantic module.
The reordering of statements and functions is a common technique in code plagiarism, and the order of functions can affect feature extraction. If NLP models are used to extract features from binary functions, the results may not be optimal. Convolutional kernels, on the other hand, can extract local contextual features from sliding windows that are not affected by the position of occurrence, which aligns with the requirements of binary code similarity detection. Convolutional Neural Networks (CNNs) have found widespread application across diverse domains within the field of deep learning [30–37]. In text processing, TextCNN uses convolutional kernels of different sizes to extract n-gram features at different positions, obtaining semantic features at various levels. However, its drawback is that it can only capture local relationships in the text, ignoring the impact of long-distance semantics. To address this issue, we introduce local semantic module, which is based on the understanding that semantic comprehension proceeds sequentially from front to back. Therefore, the information before the convolution operation on the instruction embeddings is crucial. In local semantic module, the performance of the model is enhanced by continuously incorporating previous information during the convolution operation; the process is illustrated in Fig 4.
First, embedding vectors are obtained using instruction embedding module, and the previous semantic matrix R = {r0, r1, ⋯, rn} is generated based on the embedding matrix S = {s1, s2, ⋯, sn}, as expressed by Eq (6):
(6)
where r0 is a zero vector. Subsequently, dimensionality is reduced through a fully connected layer to obtain the previous information vector G = {g0, g1, ⋯, gn}. Convolution operations are then performed using convolution kernels W of sizes 3, 4, and 5. In each convolution operation, features ci are obtained, as expressed by Eq (7):
(7)
where h*d represents the size of the convolution Wh, d is the dimension of token embedding, f is a nonlinear activation function, bh is a bias term, and Si:i+h−1 is the local code matrix from the i-th to the (i + h − 1)-th row of S. Finally, the local features are fused with the previous information features to obtain the convolution result ui. Max pooling is applied to U = {u1, u2, ⋯, ui+h−1} using the following formulas:
(8)
(9)
As bidirectional LSTM takes input in a sequential structure and pooling disrupts the sequential structure of U, a fully connected layer is added to concatenate Mi from the pooling layer into a vector Q:
(10)
Subsequently, the new concatenated high-order vector Q serves as the input for bidirectional LSTM.
3.3.2 Semantic association module.
LSTM is a variant of RNN that addresses issues related to long-range information loss, gradient vanishing, and exploding gradients. It is particularly well-suited for handling sequential information. LSTM analyzes input information using time sequences and incorporates forget, input, and output gates. The calculations for the forget gate ft, memory gate it, output gate ot, temporary memory state , current memory state Ct, and current hidden state ht are formulated as follows:
(11)
(12)
(13)
(14)
(15)
(16)
where Wf, Uf, Wi, Ui, Wc, Uc, Wo, Uo are the respective weight matrices for the connections, σ denotes the sigmoid function, and bf, bi, bc, bo are bias vectors for the different stages.
LSTM can only capture previous information, and lacks integration of contextual information in the reverse direction, neglecting information from the posterior context. To achieve more comprehensive feature extraction of instruction vectors and improve result accuracy, bidirectional LSTM is employed to better capture contextual information from both directions, acquiring richer semantic features of binary code. Bidirectional LSTM, which is composed of two LSTM layers in opposite directions, merges the hidden states of the two LSTMs to obtain the output of bidirectional LSTM, enabling access to future contextual information. Bidirectional LSTM is utilized to extract contextual information from local semantic module separately, yielding the forward feature sequence and backward feature sequence
. The calculation formula for the instruction feature vector hw output by bidirectional LSTM is as follows:
(17)
After bidirectional LSTM, a self-attention layer is added to calculate the weight coefficients for each instruction in the function. This mechanism learns the relationships between instructions within a function, grasps structural information in the code, and enhances the capture of long-range information in the instructions. Self-attention can be formulated as follows:
(18)
Finally, the function information undergoes processing through a fully connected layer and activation function, resulting in vector representations for the two binary functions.
3.4 Improved cosine similarity
Siamese is employed to analyze the similarity between two comparable entities, and a crucial component of this architecture is the distance function that represents the similarity. The performance of the model is significantly affected by the choice of the distance function. Methods like Gemini and SAFE measure the similarity between corresponding binary code by computing the cosine similarity between the vectors representing the two functions. Traditional cosine similarity primarily considers the direction of data vectors and does not account for values, making it unable to distinguish cases where the values of various dimensions in two vectors are proportionally scaled. To address this challenge, we improve traditional cosine similarity from the perspective of data values by proposing improved cosine similarity, which utilizes the Hellinger distance [38] and the difference in values taken by different data objects along the same dimension to enhance traditional cosine similarity, As illustrated in Fig 5. The calculation of improved cosine similarity is given by Eq (19):
(19)
where V = (v1, v2, ⋯, vm) and W = (w1, w2, ⋯, wm) are two m-dimensional data vectors.
Improved cosine similarity considers the data values in the similarity calculation, addressing the drawback of traditional cosine similarity’s insensitivity to numerical values. This method retains the advantages of traditional cosine similarity in directional discrimination, while also distinguishing data objects with proportionally scaled values in various dimensions. As a result, it achieves more accurate measurements of code similarity.
3.5 Loss function
In this paper, pairs of similar and dissimilar code snippets are used as inputs to train the Siamese network, cross-entropy is chosen as the loss function. For each input pair, the probability p is utilized to predict the similarity. Loss function is given by
(20)
where yi represents the label of sample i, where similarity is denoted as 1 and dissimilarity as −1; and pi denotes the predicted probability of i being similar.
3.6 Algorithm steps
- Step 1: Collect binary files for preprocessing. Extract the assembly instruction sequences of binary functions using disassembly tools, and input the instructions into the instruction embedding module. Utilize the instruction embedding module to embed the instruction sequences of the two input binary functions into a vector space, converting assembly instructions into real-valued vectors.
- Step 2: Input the obtained instruction vectors into the local semantic module. After undergoing convolutional operations with multiple different-sized kernels, extract key features and deeper structural information from the binary functions.
- Step 3: Perform max pooling on each feature map obtained from the local semantic module in the pooling layer, resulting in unified feature values. Obtain hidden vectors of fixed length.
- Step 4: Utilize the semantic association module to capture long-distance global feature information of binary code. Employ self-attention mechanism to calculate the weight coefficients of each instruction in the function, assigning different weights to instruction vectors, thus learning inter-function dependencies.
- Step 5: The function information undergoes processing through a fully connected layer and activation function, resulting in vector representations for the two binary functions.
- Step 6: Utilize an improved cosine similarity method to calculate the distance between the embedding vectors of two binary functions, outputting the similarity between the two binary functions.
4. Experimental setup
4.1 Datasets
To compare our model with existing methods, we created three binary code datasets, having different architectures, different compilers, and different optimization levels. We chose x64, x86, and ARM as the different architectures, Clang and GCC as the different compilers, and O0, O1, O2, and O3 as the different optimization levels. The same source code was compiled into different binary files in various compilation environments. After compilation, we used the ANGR binary analysis Python framework to disassemble the binary code, extract basic blocks, and label binary functions along with their names. In total, the cross-architecture dataset consisted of 128,396 functions, the cross-compiler dataset consisted of 102,682 functions, and the cross-optimization level dataset consisted of 203,738 functions.
4.2 Dataset processing
After the datasets were obtained, preprocessing was essential for the binary code similarity detection task. The objective of the task was to determine whether two functions are similar, requiring the generation of pairs of functions. Each function pair consisted of two binary functions and a label indicating whether they are similar. In this study, there were two types of pairs: similar function pairs and dissimilar function pairs. Similar function pairs were associated with a label of +1, while dissimilar function pairs were associated with a label of −1. The dataset was divided into a training set, validation set, and test set in an 8:1:1.
4.3 Hyperparameter selection
The experiments were conducted on the Windows 10 operating system, utilizing an NVIDIA RTX 3090 GPU and the PyCharm development platform.
Regarding the hyperparameters, in the pretraining phase for the assembly instruction representation method, as the average length of assembly instructions is smaller than the average length of sentences in natural language, the number of attention heads for instruction embedding module was reduced from 12 (as in BERT) to 8. The token embedding dimension was 256, and the other parameters remained unchanged. For local semantic module, window sizes of 3, 4, and 5 with a stride of 1 were used. The hyperparameter table for training the model is as follows in Table 1:
5. Evaluation
We selected Ins2vec [22], SAFE [1], and TCCCD [23] as baselines. The evaluations of different methods were conducted in three experiments involving different architectures, compilers, and optimization levels.
Ins2vec [22] is a straightforward neural network architecture based on bidirectional gate recurrent units. It employs an assembly language model utilizing length-ratio shuffle and skip-gram with negative sampling for instruction embedding, focusing solely on the semantic information of basic blocks. These embeddings are subsequently fed into a Siamese architecture to learn basic block embeddings.
SAFE [1] acts directly on disassembled binary functions without the need for manual feature extraction, enabling rapid generation of embeddings for hundreds of binary files via the Word2Vec model. This approach utilizes bidirectional gate recurrent units to capture the sequential interactions of instructions by considering both the instructions themselves and the context of the function. Furthermore, the Self-Attentive Network disregards noise and redundant information in the input, focusing on features crucial for the detection outcome.
TCCCD [23] initially represents code as abstract syntax trees (ASTs) and segments the ASTs into syntactic subtrees, thereby encapsulating the hierarchical structure and information of the code. Subsequently, in terms of neural network architecture, TCCCD employs the Encoder part of the Transformer to extract the global information of the code, and then utilizes convolutional neural networks to capture the local information of the code. Finally, it integrates features extracted from both networks to learn code vector representations imbued with lexical, syntactic, and structural information.
5.1 Different architectures
In this experiment, we utilized the same compiler, GCC-8.2.0, and the identical optimization level, O0. x86, x64, and x86 function pools were separately used as target function pools, with x64, ARM, and ARM function pools as function pools for target function search. The experimental results for three different architectures—x86 and x64, x64 and ARM, x86 and ARM—are presented in Tables 2–4.
In the three sets of experiments involving x86 and x64, x64 and ARM, x86 and ARM, BinBcla’s accuracy was 0.924, 0.917, and 0.922, the precision was 0.915, 0.893, and 0.895, and the F1 score was 0.912, 0.885, and 0.881, respectively. In the experiments, the average accuracy of Ins2vec, SAFE, and TCCCD were 0.801, 0.876, and 0.889, respectively. The average precision values were 0.768, 0.822, and 0.848, respectively. The average F1 score values were 0.778, 0.805, and 0.847, respectively. Our model achieved an average accuracy of 0.921, an average precision of 0.901 and an average F1 score of 0.893. Overall, BinBcla showed an improvement of 14.98%, 5.14%, and 3.60% in accuracy, 17.32%, 9.61%, and 6.25% in precision, and 14.78%, 10.93%, and 5.43% in F1 score compared with Ins2vec, SAFE, and TCCCD, respectively. The results indicate that in cross-architecture experiments, the proposed model demonstrates better resistance to cross-architecture differences and exhibits stronger robustness.
5.2 Different compilers
In this experiment, the same architecture (x64) and identical optimization level (O0) were employed. Clang-4.0, GCC-5.5.0, and Clang-7.0 function pools were, respectively, used as the target function pools; and Clang-7.0, GCC-8.2.0, and GCC-8.2.0 function pools served as the target function search pools. In the domain of binary code similarity, varying versions of compilers are considered as different compilers. The results of the three experimental sets are presented in Tables 5–7.
It can be observed that in the experiments involving Clang-4.0 and Clang-7.0, GCC-5.5.0 and GCC-8.2.0, Clang-7.0 and GCC-8.2.0, BinBcla’s accuracy was 0.932, 0.947, and 0.943, respectively, with precision values of 0.903, 0.931, and 0.917 and F1 score values of 0.901, 0.925, and 0.908. In the experiments, the average accuracy of Ins2vec, SAFE, and TCCCD were 0.829, 0.898, and 0.916, respectively. The average precision values were 0.816, 0.854, and 0.875, respectively. The average F1 score values were 0800, 0.865, and 0.877, respectively. Our model achieved an average accuracy of 0.941, an average precision of 0.917 and an average F1 score of 0.911. Overall, compared to Ins2vec, SAFE, and TCCCD, BinBcla exhibited improvements in accuracy of 13.51%, 4.79%, and 2.73%, in precision of 12.38%, 7.38%, and 4.80%, and in F1 score of 13.88%, 5.32%, 3.88%, respectively, in experiments involving different compilers. These results indicate that in cross-compiler experiments, the proposed model demonstrates greater resistance to cross-compiler variations, achieving higher accuracy, precision and F1 score.
5.3 Different optimization levels
In this experiment, the dataset with cross-optimization levels was utilized, using the same x64 architecture and the Clang-7.0 compiler. The target function pools were formed using functions compiled with O0, O0, O0, O1, O1, and O2 optimization levels, while the search function pool consisted of functions compiled with O1, O2, O3, O2, O3, and O3 optimization levels. The results of six different optimization levels are presented in Tables 8–10.
In the experiments, BinBcla’s accuracy was 0.941, 0.933, 0.916, 0.934, 0.928, and 0.936, respectively. The precision values were 0.923, 0.904, 0.901, 0.929, 0.907, and 0.913, respectively. The F1 score values were 0.920, 0.895, 0.893, 0.917, 0.892, and 0.915, respectively. In the experiments, the average accuracy of Ins2vec, SAFE, and TCCCD were 0.818, 0.887, and 0.903, respectively. The average precision values were 0.725, 0.807, and 0.849, respectively. The average F1 score values were 0.718, 0.799, and 0.863, respectively. Our model achieved an average accuracy of 0.931, an average precision of 0.913, and an average F1 score of 0.905. Overall, BinBcla demonstrated improvements in accuracy of 13.81%, 4.96%, 3.10% relative to Ins2vec, SAFE, and TCCCD, improvements in precision of 25.93%, 13.14%, 7.54%, and improvements in F1 score of 26.04%, 13.27%, 4.87%, respectively. In all six sets of the experiments, our model achieved the highest accuracy, precision and F1 score among all the models, indicating that in cross-optimization level experiments, our model is more robust to cross-optimization level differences and exhibits superior performance.
5.4 Ablation studies
In the ablation experiments, we investigated several factors influencing the detection performance of the model, including the embedding model, local semantic module, and the distance function.
In the ablation experiments, the x64 architecture was retained, and the compilation optimization level was O0. We used Clang-7.0’s function pool as the target function pool and GCC-8.2.0’s function pool as the search function pool for the target functions. In Model 1, instruction embedding module was replaced with Word2Vec and the other components were unchanged. In Model 2, local semantic module was replaced with CNN and the other components were unchanged. In Model 3, improved cosine similarity was replaced with traditional cosine similarity and the other components were unchanged. In Model 4, local semantic module was removed and the other components were unchanged. Model 5 represents our proposed model. The results are shown in Table 11.
- In the experiments evaluating the embedding models, keeping the other components unchanged, we only varied the embedding model. The accuracy using the Word2Vec model was 0.786, precision was 0.728, and F1 score was 0.742. In contrast, our model, utilizing instruction embedding module, achieved an accuracy of 0.943, precision of 0.917, and F1 score was 0.908. This suggests that instruction embedding module can dynamically adjust token vectors based on different contextual information, addressing the issue of polysemy that Word2Vec struggles with. This adaptation capability contributes to improved accuracy, precision and F1 score in the model.
- In the experiments evaluating local semantic module, while keeping the other components unchanged, we replaced local semantic module with standard CNN. The model using CNN achieved an accuracy of 0.925, precision of 0.906 and F1 score was 0.890. Our model, incorporating local semantic module, achieved an accuracy of 0.943, precision of 0.917 and F1 score was 0.908. This validates that local semantic module, by not only focusing on local information features, but also incorporating previous information, contributes to enhancing the model’s performance. When local semantic module was removed and the other components were unchanged, the model achieves an accuracy of 0.916, a precision of 0.894 and F1 score was 0.891. This validates that multi-level semantic feature extraction can obtain more accurate code semantic information.
- In the experiments evaluating the distance function, while keeping the other components unchanged, we replaced improved cosine similarity with traditional cosine similarity. The model using traditional cosine similarity achieved an accuracy of 0.912, precision of 0.873 and F1 score was 0.875. Our model, employing improved cosine similarity, achieved an accuracy of 0.943, precision of 0.917 and F1 score was 0.908. This indicates that improved cosine similarity better captures the relationships between all elements in the distance vector, fundamentally enhancing the model’s robustness to new sample functions, making complex inferences feasible, and improving overall performance.
5.5 Hyperparameter experiment
This section discusses the impact of hyperparameters on BinBcla’s performance, focusing on the number of epochs and the dimensions of function embeddings. In the hyperparameter experiments, we maintained a consistent architecture (x64) and optimization level (O0). The target function pool was derived from functions compiled with Clang-7.0, and the function pool for searching was based on functions compiled with GCC-8.2.0.
- The number of epochs is crucial in training the model. If the number is too low, the model may not have sufficient training time. However, if the number is too high, the model might face overfitting. To determine the optimal number of epochs, we trained the model with 100 epochs and then evaluated its performance. As depicted in Fig 6, our model’s accuracy, precision and F1 score became relatively stable around the 60th epoch. Moreover, after stability was achieved, there was no degradation in performance metrics with a continuous increase in epochs, indicating that our model exhibits better robustness.
- The dimension of function embeddings is another critical hyperparameter that we examined. We conducted tests using different embedding dimensions. Intuitively, an embedding vector with a lower dimension contain relatively less semantic information, potentially leading to decreased model performance. As illustrated in Fig 7, when the embedding dimension was set to 32, the accuracy, precision and F1 score of the model were relatively low. With an increase in embedding dimension, the accuracy, precision and F1 score continued to rise. However, when the embedding dimension surpassed 256, there was almost no further increase in accuracy, precision and F1 score. Additionally, an excessively large embedding dimension can result in higher memory usage and longer training time. Therefore, considering the balance between effectiveness and resource constraints, we selected 256 as the optimal embedding dimension.
5.6 Limitations
- Function Inlining: The purpose of function inlining is to reduce the runtime overhead incurred when entering and exiting functions. However, it can result in significant changes to the assembly program, posing challenges for labeling based on function names, as binary functions with similar semantics may be classified as dissimilar. In the future, we will further investigate this issue.
- Code Obfuscation: Our model primarily focuses on analyzing the similarity of binary code without applying code obfuscation techniques. This limitation means that our model cannot directly be applied to binary code that has undergone code obfuscation. However, in the future, we plan to conduct research in this area to expand the capabilities of our model.
6. Conclusion
In this paper, we proposed a novel binary code similarity detection model, called BinBcla. Firstly, considering the differences between binary code and natural language. To further enhance code representation, we develop an instruction embedding module with a newly designed training task, based on an enhanced BERT architecture, to provide dynamic semantic embeddings for binary code instructions. It dynamically adjusts token embedding based on varying contextual information, addressing the challenge of handling polysemy that Word2Vec struggles with. Secondly, to address the issue of insufficient semantic information extraction by a single neural network model, we propose a multi-feature fusion technique that performs feature extraction at different levels on binary code, greatly enriching the semantic feature information of binary code. Finally, considering traditional cosine similarity being unable to distinguish differences between data objects with proportionally changing values in various dimensions, we propose an improved cosine similarity method that learn the distance vectors, understanding relationships among all elements of the distance vectors. Through experimental evaluation, our proposed method demonstrated superior performance across different architectures, compilers, and optimization levels compared with previous models. This result demonstrates the effectiveness and advancement of the proposed method.
In the future, we plan to integrate graph structure-aware embedding with semantic information to further enhance the model’s performance. Additionally, we will conduct experiments related to code obfuscation detection.
References
- 1.
Massarelli L, Di Luna GA, Petroni F, et al. Safe: Self-attentive function embeddings for binary similarity. Proceedings of Detection of Intrusions and Malware, and Vulnerability Assessment: 16th International Conference, 2019. p. 309–329.
- 2.
Ding SHH, Fung BCM, Charland P. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. Proceedings of IEEE Symposium on Security and Privacy, 2019. SP; 2019. p. 472–489.
- 3.
Li X, Qu Y, Yin H. Palmtree: Learning an assembly language model for instruction embedding. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021. p. 3236–3251.
- 4.
Gu Y, Shu H, Hu F. UniASM: Binary code similarity detection without fine-tuning. arXiv.2211.01144. 2023.
- 5.
Feng Q, Zhou R, Xu C, et al. Scalable graph-based bug search for firmware images. Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2016. p. 480–491.
- 6.
Pewny J, Garmany B, Gawlik R, et al. Cross-architecture bug search in binary executables. Proceedings of IEEE Symposium on Security and Privacy, 2015. p. 709–724.
- 7. David Y, Partush N, Yahav E. Statistical similarity of binaries. ACM SIGPLAN Not. 2016;51:266–280.
- 8.
Ming J, Xu D, Jiang Y, et al. BinSim: Trace-based semantic binary diffing via system call sliced segment equivalence Checking. Proceedings of 26th USENIX Security Symposium, 2017. p. 253–270.
- 9.
Chandramohan M, Xue Y, Xu Z, et al. Bingo: Cross-architecture cross-OS binary search. Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016. p. 678–689.
- 10.
Egele M, Woo M, Chapman P, et al. Blanket execution: Dynamic similarity testing for program binaries and components. Proceedings of 23rd USENIX Security Symposium, 2014. p. 303–317.
- 11.
Luo L, Ming J, Wu D, et al. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014. p. 389–400.
- 12.
Nouh L, Rahimian A, Mouheb D, et al. Binsign: Fingerprinting binary functions to support automated analysis of code executables. Proceedings of ICT Systems Security and Privacy Protection: 32nd IFIP TC 11 International Conference, 2017. p. 341–355.
- 13.
Ding SHH, Fung BCM, Charland P. Kam1n0: Mapreduce-based assembly clone search for reverse engineering. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. p. 461–470.
- 14.
Huang H, Youssef AM, Debbabi M. Binsequence: Fast, accurate and scalable binary code reuse detection. Proceedings of the ACM on Asia Conference on Computer and Communications Security, 2017. p. 155–166.
- 15.
Feng Q, Wang M, Zhang M, et al. Extracting conditional formulas for cross-platform bug search. Proceedings of the ACM on Asia Conference on Computer and Communications Security, 2017. p. 346–359.
- 16.
Eschweiler S, Yakdan K, Gerhards-Padilla E. DiscovRE: Efficient cross-architecture identification of bugs in binary. Code. Proceedings of network and distributed systems security (NDSS) symposium. 2016. p. 58–79.
- 17.
Pewny J, Schuster F, Bernhard L, et al. Leveraging semantic signatures for bug search in binary programs. Proceedings of the 30th Annual Computer Security Applications Conference, 2014. p. 406–415.
- 18. Wu S, Gao X, He H. Topic detection algorithm based on bilateral cosine similarity. Oper Res Manag Sci. 2021;30:75–83.
- 19.
Xu X, Liu C, Feng Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection. Proceedings of the ACM SIGSAC conference on computer and communications security, 2017. p. 363–376.
- 20.
Zuo F, Li X, Young P, et al. Neural machine translation inspired binary code similarity comparison beyond function pairs. Proceedings of network and distributed systems security (NDSS) Symposium. 2019. p. 51–68.
- 21.
Redmond K, Luo L, Zeng Q. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. arXiv.1812.09652. 2018.
- 22.
Li W, Jin S. A simple function embedding approach for binary similarity detection. Proceedings of 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications. Big Data Cloud Comput Sustain Comput Commun Soc Comput Netw. 2020. p. 570–577.
- 23. Ben K, Yang J, Zhang X, et al. Code clone detection based on transformer and convolutional neural network. J Zhengzhou Univ (Eng Sci). 2023;44:12–18.
- 24.
Liu B, Huo W, Zhang C, et al. αdiff: Cross-version binary code similarity detection with DNN. Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018. p. 667–678.
- 25. Wan Z, Wang F, Huang S. Chinese news classification based on weighted word vector and improved TextCNN. Softw Guide. 2023;22:59–64.
- 26.
Duan Y, Li X, Wang J, et al. Deepbindiff: Learning program-wide code representations for binary diffing. Proceedings of Network and distributed system security symposium, 2020. p. 1–16.
- 27. Yang J, Fu C, Liu XY, Yin H, Zhou P. Codee: A tensor embedding scheme for binary code search. IEEE Trans Softw Eng. 2021;48:2224–2244.
- 28.
Massarelli L, Di Luna GA, Petroni F, et al. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. Proceedings of the 2nd Workshop on Binary Analysis Research (BAR), 2019. p. 1–11.
- 29.
Yu Z, Cao R, Tang Q, et al. Order matters: Semantic-aware neural networks for binary code similarity detection. Proceedings of the AAAI conference on artificial intelligence, 2020. p. 1145–1152.
- 30. Raza A, Uddin J, Almuhaimeed A, et al. AIPs-SnTCN: Predicting anti-inflammatory peptides using fastText and transformer encoder-based hybrid word embedding with self-normalized temporal convolutional networks. Journal of Chemical Information and Modeling, 2023, 63(21): 6537–6554. pmid:37905969
- 31. Akbar S, Raza A, Al Shloul T, et al. pAtbP-EnC: identifying anti-tubercular peptides using multi-feature representation and genetic algorithm based deep ensemble model. IEEE Access, 2023.
- 32. Akbar S, Raza A, Zou Q. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC bioinformatics, 2024, 25(1): 102. pmid:38454333
- 33. Akbar S, Hayat M, Tahir M, et al. cACP-DeepGram: classification of anticancer peptides via deep neural network and skip-gram-based word embedding model. Artificial intelligence in medicine, 2022, 131: 102349. pmid:36100346
- 34. Akbar S, Khan S, Ali F, et al. iHBP-DeepPSSM: Identifying hormone binding proteins using PsePSSM based evolutionary features and deep learning approach. Chemometrics and Intelligent Laboratory Systems, 2020, 204: 104103.
- 35. Akbar S, Zou Q, Raza A, et al. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artificial Intelligence in Medicine, 2024, 151: 102860. pmid:38552379
- 36. Inayat N, Khan M, Iqbal N, et al. iEnhancer-DHF: identification of enhancers and their strengths using optimize deep neural network with multiple features extraction methods. Ieee Access, 2021, 9: 40783–40796.
- 37. Khan F, Khan M, Iqbal N, et al. Prediction of recombination spots using novel hybrid feature extraction method via deep learning approach. Frontiers in Genetics, 2020, 11: 539227. pmid:33093842
- 38. Sohangir S, Wang D. Improved sqrt-cosine similarity measurement. J Big Data. 2017;4:1–13.