Figures
Abstract
Ordinary users encounter various documents on the network every day, such as news articles, emails, and messages, and most are vulnerable to malicious attacks. Malicious attack methods continue to evolve, making neural network-based malware detection increasingly appealing to both academia and industry. Recent studies have leveraged byte sequences within files to detect malicious activities, primarily using convolutional neural networks to capture local patterns in the byte sequences. Meanwhile, in natural language processing, Transformer-based language models have demonstrated superior performance across various tasks and have been applied to other domains, such as image analysis and speech recognition. In this paper, we introduce a novel Transformer-based language model for malware detection that processes byte sequences as input. We propose two new pre-training strategies: real-or-fake prediction and same-sequence prediction. Including conventional pre-training strategies such as masked language modeling and next-sentence prediction, we explore all possible combinations of these approaches. By compiling existing byte sequences for malware detection, we construct a benchmark consisting of three file types (PDF, HWP, and MS Office) for pre-training and fine-tuning. Our empirical results demonstrate that our language model outperforms convolutional neural networks in the malware detection task, achieving a macro F1 score improvement of approximately 2.7%p∼11.1%p. We believe our language model will serve as a foundation model for malware detection services, and will extend our research to develop a more powerful encoder-based model that can process longer byte sequences.
Citation: Kim E-J, Lee Y-K, Lee S-M, Kim J-N, Kang AR, Kim M-s, et al. (2025) Malware detection using pre-trained transformer encoder with byte sequences. PLoS One 20(10): e0332307. https://doi.org/10.1371/journal.pone.0332307
Editor: Daniel Ioan Hunyadi, Lucian Blaga University of Sibiu: Universitatea Lucian Blaga din Sibiu, ROMANIA
Received: April 1, 2025; Accepted: August 22, 2025; Published: October 13, 2025
Copyright: © 2025 Kim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The relevant data are available in a public repository at the following link: https://doi.org/10.6084/m9.figshare.30127783.v1.
Funding: This work was supported by the Korea Research Institute for Defense Technology Planning and Advancement(KRIT) - Grant funded by Defense Acquisition Program Administration (DAPA)(KRIT-CT-21-037). This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (RS-2020-NR049604). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Online users are vulnerable to malicious programs or files, especially non-executables such as Microsoft Excel, Word, PowerPoint, and Hangul documents, where Hangul is a widely-used document software in South Korea. When people find such non-executable attachments in their mailbox, they are often not very cautious when opening the documents. This may harm their systems or cause severe damage to important files (e.g., customer databases), so it is necessary to develop a method that analyzes and detects malicious non-executables. There are mainly two ways to perform malware analysis: static and dynamic. The dynamic method involves a separate platform or an isolated virtual environment and examines step-by-step actions of a suspicious program. Existing studies of this method have a limitation in that they are not reproducible on different emulations. On the other hand, the static method analyzes the suspicious file without running it, making it preferable as the number of files on online platforms continues to increase.
Recently, studies have exploited byte sequences within non-executables for the task of malware detection. These studies assume that there are sequential patterns underlying the byte sequences within the files, since the files contain semantic content following particular formats (e.g., portable document format, docx). The general process of them is depicted in Fig 1. Suppose that the file ‘A’ is a non-executable containing multiple byte sequences, and the set of sub-sequence samples can be obtained in a sliding-window manner. The detection model at the bottom of the Fig 1 takes the t-th sample st as input where
, and gives the prediction output ot. When we have ot = malware for any t, the file ‘A’ is considered suspicious, and we can take some action (e.g., deletion) on the file. This paper proposes a new detection model that is pre-trained with a large amount of byte sequences, and we demonstrate its superior performance compared to previous detection models.
Although previous studies on malware detection using byte sequences have shown promising results, this line of work has common issues. First, only a few datasets are publicly available (e.g., MS Office dataset [1]). One may argue that we may borrow public datasets (e.g., Virusshare). However, not all document-type files can be used for training because the objects within the file must have labels indicating which parts correspond to malicious actions. Annotating these labels requires expert input and is costly, making publicly available annotated datasets scarce. The lack of datasets makes it difficult to reproduce and compare existing methods. Second, there are only a few Transformer-based studies. Since the Transformer was introduced [2] in the field of natural language processing (NLP), many variants and language models have emerged, such as Bidirectional Encoder Representations from Transformers (BERT) [3] the Generative Pre-trained Transformer (GPT) series [4–6]. Although Transformer-based methods have shown successful performance in many tasks not only in the NLP field but also in image analysis, speech analysis, and other areas, there have been only a few Transformer-based studies for malware detection using byte sequences.
Some previous studies have adopted Graph Neural Networks (GNNs) [7,8], or combined Convolutional Neural Networks (CNNs) with Recurrent Neural Networks (RNNs) [9], to tackle different problems, and the CNNs have often been adopted as a malware detection model because CNN-based models are known to be effective in capturing local patterns [1,10,11]. These studies assumed that there exist useful local patterns in the byte sequences, and the convolution operations were used to capture the patterns. Even though the byte sequences are often compressed or encrypted, the studies revealed that the CNN-based models effectively work by capturing the local patterns in the byte sequences. For long byte sequences, they proposed a shallow-wide architecture or a stack of convolutions.
Since Transformer [2] has appeared, its variants achieved success in not only the natural language processing (NLP) area, but also in malware detection and classification. The most well-known Transformer variants are BERT and GPT series, where the BERT and GPT are the encoder and decoder part of the Transformer, respectively. There are few studies that employed the GPT (transformer decoder) for malware detection tasks. Nazenın şahın [12] proposed a method that feeds the assembly code obtained from a static analysis on PE files to GPT-2 language model [5]. In Denız Demırcı et al. [13], the assembly instructions are used as an input for a stacked long-short term memory (LSTM) [14] and GPT-2 language model. While the above studies adopted the Transformer decoder, more studies have employed BERT (Transformer encoder) since BERT is known to be more efficient and powerful in natural language understanding (NLU) tasks, whereas the GPT series is effective in natural language generation (NLG) tasks. The malware detection and classification tasks are NLU-related tasks, making the BERT-based approach preferable. For example, MalBERT is a BERT-based approach that takes source codes of Android applications as input and generates distributed representations used for malware classification [15]. They proved that MalBERT was superior to LSTMs through experimental results. In Ferhat Demirkıran et al. [16], a pre-trained CANINE BERT [17] was borrowed for API call sequence analysis. They showed that their method achieves state-of-the-art performance in malware classification. MalBERTv2 [18] was an extension of their prior work (i.e., MalBERT), incorporating new pre-tokenization and a complete pipeline for malware analysis. All previous studies using Transformer architecture have indicated that pre-training with a large amount of data allows the model to learn arbitrary knowledge, so that the model achieves higher performance in target tasks by better understanding the underlying patterns of the tasks.
In this paper, we treat the byte sequences within non-executables as character sequences within documents, and find an answer to a question: “What if a Transformer-based model is pre-trained with a large amount of byte sequences? Will it work for malware detection task?” We designed and constructed a new Transformer encoder-based pre-trained model that learns the representation of byte sequences within files, namely ByteEnc. As far as we know, this is the first study to pre-train an encoder-based language model using only byte sequences. Furthermore, we design new pre-training strategies and investigate their impact through experimental results on the malware detection task. Our contributions can be summarized as follows:
- Pre-training algorithm: We design new pre-training algorithms and examine possible combinations with other pre-training algorithms to improve task performance.
- Pre-trained model: We introduce our Transformer encoder-based pre-trained model, namely ByteEnc. As far as we know, this is the first encoder-based model that is pre-trained exclusively with byte sequences of non-executable documents. We demonstrate the effectiveness of the pre-trained model by performing experiments on three file formats (e.g., Portable Document Format (PDF), Hangul Word Processor (HWP), and Microsoft (MS) Office).
Materials and methods
Materials
Although there have been studies on malware detection using byte sequences, most of them did not disclose their datasets, and only a few public datasets are available, as follows. In Young-Seob Jeong et al. [11], byte sequences of PDF documents were provided as files of comma-separated values (CSV) format. This dataset is downloadable from a public website (https://sites.google.com/view/datasets-for-public/). A new CNN-based design was proposed for malware detection in Hangul Word Processor (HWP) documents as a defensive measure to protect governmental institutes of South Korea from North Korea, and their HWP dataset is available upon request [19]. Byte sequences of Microsoft (MS) Office documents (e.g., MS Word, MS PowerPoint) were provided [1]. We utilize all available datasets above and make them into a uniform format for convenient use. Following previous studies [1,10,20], we extract sample sequences (i.e., sub-sequences) of a predetermined length, as the existing datasets contain byte sequences of varying lengths. Specifically, considering the byte sequence sO in Fig 2, the sub-sequences were randomly extracted, with a maximum of 100 sub-sequences from the same byte sequence. The extracted samples are 512 bytes in length.
From the available byte-sequence datasets above, we constructed two different datasets for experiments: (1) a task dataset for malware detection and (2) a pre-training dataset. The pre-training dataset was used only for the pre-training phase, and the task dataset was used to finetune the pre-trained LM for malware detection. A summary of the datasets is described in Table 1. The raw file contains one or more byte sequences, and each byte sequence is annotated with malware, normal}. Since the raw file provides an exact offset (i.e., position) of a malicious action, the malware sub-sequences are sampled to ensure they contain the offset. For convenience, we rearrange the raw files of different formats to have the same template. Furthermore, we also provide the sampled sub-sequences in a format readable by the Transformers library. We checked the licenses of all data sources and confirmed that it is allowed to disclose our reformed dataset.
ByteEnc
The architecture of our proposed language model, ByteEnc, is the same as that of BERT; ByteEnc has stacked Transformer encoders. However, it is not trivial to make ByteEnc work properly with byte streams because of two issues: tokenization and positional encoding.
Tokenization.
The first issue is tokenization, which arises from the different nature of byte sequences compared to natural language texts. The Transformer and its variants were originally designed for natural language processing, but byte sequences differ significantly from natural language sentences. For example, byte sequences are often much longer than sentences; as reported in Young-Seob Jeong et al. [21], HWP byte sequences range from 350K to 710K bytes. It is not trivial to take such extremely long sequences as input, so many previous studies sliced the original sequences into smaller segments (i.e., sub-sequences) using a sliding-window technique.
The previous studies employing the sub-sequence sampling applied simple byte-level tokenization followed by an embedding layer that maps each byte into an embedding space. The reason for utilizing byte-level tokenization is that operational boundary-based tokenization is vulnerable to the out-of-vocabulary (OOV) problem. That is, when constructing a vocabulary consisting of operational keywords (e.g., goto, replace, ...), new keywords may appear in byte sequences due to program version updates or the emergence of new file types. Such OOV keywords, which are not present in the vocabulary, can significantly degrade the generalizability of models. Another reason is that the sub-sequences are not consistent with operation boundaries. Consider the two sub-sequences s2 and s3 at the top of Fig 2. They begin in the middle of an operation (e.g., jump), and such inconsistency with the operation boundary may cause poor tokenization if NLP tokenizers (e.g., byte-pair encoding (BPE) tokenizer) are used. The byte-level tokenization alleviates this problem by simply converting a sub-sequence into a sequence of bytes, defining each byte as a token. However, the byte-level tokenization may lose arbitrary semantic patterns (e.g., operation keywords) within the sub-sequences. To systematically examine this, we formulate the previous byte-level tokenization as a special type (i.e., n=1) of byte-level n-gram tokenization. The bottom of Fig 2 describes how the byte-level n-grams work. For example, the sub-sequence s3 will be converted into a sequence of four tokens when n=1, three tokens when n=2, and two tokens when n=3. Empirical results with different settings of n in the Results and discussion section will show that the best performance is achieved when n=1.
Positional encoding.
The second issue is positional encoding. Language models in the NLP field, such as BERT variants and the GPT series, rely on specific positional encoding. There are mainly two types of positional encoding algorithms: absolute and relative. The most well-known absolute positional encoding is the Sinusoid positional encoding of the Transformer [2], while a few widely used algorithms utilize relative positional information: relative positional embedding [22], AliBi [23], and rotary positional embedding [24]. Applying absolute positional encodings to byte sequences is not appropriate because byte sequences differ from natural language texts. As stated in Young-Seob Jeong et al. [19], a malicious action within a byte sequence may consist of small fractions scattered throughout the sequence. Suppose a set S of sub-sequences is sampled from a long byte sequence with randomly scattered fractions of a malicious action. In that case, absolute positional information within every sub-sequence will not be informative. Therefore, instead of absolute positional encoding, we use relative positional encoding; specifically, we adopt the relative positional embedding (RePE) [22] in this paper.
One may argue that the RePE is not useful because malicious actions may appear in random positions within each sub-sequence. To investigate this, we also try not using any positional encodings, which is equivalent to the no-positional-encoding (NoPE) [25]. Experimental results will show that the RePE gives better performance than the NoPE.
Pre-training strategy
There have been several pre-training (PT) algorithms developed so far. Two representative PT algorithms for Transformer encoder-based models are masked-language modeling (MLM) and next-sentence prediction (NSP) [3]. We adopt these algorithms but also propose two new algorithms: Real-or-Fake Prediction (RFP) and Same-Sequence Prediction (SSP), which are designed according to the distinct characteristics of byte sequences.
Masked-language modeling.
The most well-known PT algorithm is masked-language modeling (MLM), which is used for pre-training BERT models [3]. The MLM algorithm introduces noise to the input sequence by replacing some original tokens with ‘mask’ tokens or random tokens. This enables the language model (LM) to comprehend the context of the given sequence, which, in turn, enhances its performance on natural language understanding (NLU) tasks. We adopt the MLM algorithm with a masking probability of pmlm = 0.15. For more details of the MLM algorithm, please refer to Jacob Devlin et al. [3].
Real-or-fake prediction.
In Kevin Clark et al. [26], the replaced token detection (RTD) algorithm was proposed to train the discriminator component of ELECTRA. The discriminator takes a token sequence, part of which is corrupted (i.e., replaced by tokens sampled from the generator), and is tasked with predicting which tokens match the original, uncorrupted sequence. Motivated by the RTD algorithm, we design a new algorithm called Real-or-Fake Prediction (RFP), which can be seen as a special type of token corruption prediction (TCP) [27].
The RFP algorithm introduces noise to part of the input sequence by replacing the original tokens with randomly chosen fake tokens. The algorithm has two probabilities: prfp1 and prfp2. It first selects candidate tokens with the probability prfp1, and each candidate token is replaced by a random (fake) token with the probability prfp2; thus, half of the candidate tokens will be fake, and the other half will be true tokens. This requires the language model (LM) to predict whether each token within the sequence is ‘real’ (uncorrupted) or ‘fake’ (corrupted), essentially making it a token-level binary classification task. As depicted in Fig 3, ByteEnc takes the two sequences s1 and s2 as input, where <CLS> and <SEP> are a classification token and a separation token, respectively. A part of the two sequences is selected as candidate bytes, and the ByteEnc predicts whether these bytes are fake or not.
The biggest difference between the RFP and MLM algorithms is that RFP is a token-level binary classification, whereas MLM is a token-level multi-class classification. Since byte sequences may contain randomly scattered fractions of malicious actions and are often compressed or encrypted, we believe that the MLM algorithm is too challenging, as it requires the LM to predict the correct bytes among compressed, encrypted, and randomly scattered bytes. In contrast, the RFP algorithm is easier than the MLM algorithm because it is a binary classification task, and we assume it provides a more suitable level of difficulty for the pre-training objective.
Next-sentence prediction.
The next-sentence prediction (NSP) algorithm is a sequence-level binary classification; given a sequence pair (si, sj), it predicts whether sj is the next sequence following si. With a probability pnsp = 0.5, sj will be the next sequence, i.e., j = i + 1; otherwise, j will be a random index. The NSP algorithm was used to pre-train BERT, and we adopted the NSP algorithm to examine its combination with the MLM or RFP algorithms. For more details of the NSP algorithm, please refer to Jacob Devlin et al. [3].
Same-sequence prediction.
Some previous studies have reported that the NSP algorithm might not be beneficial for downstream task performance because it is semantically shallow and can often be solved easily through lexical overlap [27,28]. We also believe that the NSP algorithm is too simplistic for byte sequences, so we propose an alternative algorithm, namely Same-Sequence Prediction (SSP). Given a sequence pair (si, sj), the SSP algorithm assumes that si and sj are sub-sequences sampled from an original sequence. With a probability pssp = 0.5, sj is sampled from the same original sequence as si; otherwise, sj is sampled from a different sequence. This approach requires the language model (LM) to predict whether the two sub-sequences originate from the same original sequence, making it a sequence-level binary classification task. Specifically, as shown in Fig 3, the output representation of the <CLS> token is used to generate the sequence-level prediction. Even if the sub-sequences are from the same sequence, they might be quite far apart, making this task more challenging than the NSP algorithm. We believe that the SSP algorithm encourages the LM to understand patterns at the original sequence level, which may improve task performance.
Combinations.
Based on the finding that a combination between the MLM with other PT algorithms contributed to task performance improvement [27], we investigate combinations between the PT algorithms. The above-mentioned four PT algorithms can be divided into two groups: token-level and sequence-level. The MLM and RFP algorithms are token-level, while the remaining two algorithms are sequence-level. We examine four combinations: MLM+NSP, MLM+SSP, RFP+NSP, and RFP+SSP. Fig 3 depicts an overview of the combination of our two newly proposed algorithms: Real-or-Fake Prediction (RFP) and Same-sequence prediction (SSP). Algorithm 1 describes step-by-step process of data preparation for the combination of RFP and SSP algorithms. The ByteEnc takes a concatenation or a pair of two byte sequences si and sj (see the line 2 in Algorithm 1). From a resulting vector of the first byte (i.e., <CLS>), we get a result of binary classification according to the SSP algorithm. The line 3 in Algorithm 1 generates a label for the binary classification. At the same time, the sequence pair may contain noise bytes (fake bytes) where these bytes are selected using the probability prfp (see line 4 ∼ 10). For each noise byte we get its resulting vector through a forward pass, and make a byte prediction according to the RFP algorithm.
Algorithm 1 Data preparation for RFP+SSP combination.
Require: input sequence , two probabilities of RFP
algroithm prfp1 and prfp2
1: randomly pick i and j from
2: a pair of two sequences
3: ySSP = 1 if si and sj came from the same original stream,
otherwise 0
4: xRFP = [0,0,...,0] where |xRFP| = |X|
5: yRFP = X
6: for k = 0 to do
7: if Xk not in and
(prfp)=1 then
8: yRFP,k = randomly chosen byte with a probability prfp2
9: end if
10: end for
Results and discussion
Settings
We try many settings including different positional embeddings, n-grams, and combinations of pre-training strategies. Specifically, NoPE and RePE are considered for the positional embeddings, while unigram and bigram are evaluated for n-grams. For the combinations of pre-training algorithms, we try four possible combinations (e.g., MLM+NSP, MLM+SSP, RFP+NSP, and RFP+SSP) and every single algorithm (e.g., RFP, NSP, SSP, and MLM). As it is not trivial to pre-train big language models for all different settings, we chose a small language model, BERT-small, having 4 encoder layers with hidden dimension 512, and 8 attention heads. To check scalability, we also try a greater model, BERT-base, and its results are described at the end of this section.
We employed the AdamW optimizer [29] and the cross entropy loss for both pre-training and fine-tuning. The hyper parameters for pre-training are as follows: initial learning rate lr = 5e − 05, ,
,
− 08, a weight decay wd = 0.001, mini-batch size bs = 128, and warmup steps ws = 1000. The number of epochs was 100 and 10 for the pre-training and fine-tuning, respectively. The performance metrics are precision, recall, and F1 score. For all experimental results, we conducted three independent runs and computed average performance. The specification of our machine is summarized in Table 2.
Positional encoding and n-grams
Before we compare different pre-training algorithms for malware detection, we firstly conducted experiments to find the best settings of positional encoding and n-grams. As described in the Materials and method section, we compare the NoPE and RePE by the malware detection performance and also try for the byte-level n-gram tokenization. The averaged results are summarized in Table 3.
Between the NoPE and RePE, when the n is fixed, the RePE exhibited better performance in all metrics. This indicates that the relative positional embeddings convey informative representations for malware detection. The NoPE worked better than any other positional encodings according to the work of Amirhossein Kazemnejad et al. [25], but it was a comparison between decoder-based LMs. As the encoder-based LM does not follow the auto-regressive objective, the relative positional embeddings play an important role allowing the LM to learn sequential and positional patterns from the byte sequences.
We also tried of n-grams as shown in Table 3, and found that the unigram (i.e., n = 1) gives better performance. The byte-level n-grams is essentially similar to the character-level n-grams in the NLP field. The character-level n-grams (or sub-character-level) have shown quite successful results for some languages (e.g., Chinese, Japanese) [30–32], where such languages have a common characteristic that they do not have a word boundary; they do not use white space between characters to separate words. Likewise, the sub-sequences of bytes are sampled without considering operation boundaries. The byte-level unigram seems to allow the model to avoid violating the operation boundaries and does not harm the semantic patterns within the sequence because the operation codes of programming languages are less complex than natural language sentences.
To summarize, we found that the best settings of the positional encoding and byte-level n-grams are the RePE and unigram tokenization. Besides this, the most fascinating finding is that the ByteEnc is pre-trained only with HWP sub-sequences but works well for malware detection with other different formats (e.g., PDF and MSOffice). It is consistent with the work of Young-Seob Jeong et al. [20] that utilized byte sequences of different file formats to improve malware detection performance. On the other hand, we observed worse performance for the ‘malware’ class in the HWP format compared to other formats. This may be due to the fact that HWP byte streams are significantly longer than those of other formats, and the imbalanced precision and recall result from the highly skewed distribution of HWP files, as shown in Table 1.
Pre-training algorithms
With the best RePE and unigram tokenization settings, we performed intensive experiments to investigate possible combinations of pre-training algorithms. Based on a grid search, we set prfp1 = 0.15 and prfp2 = 0.5 for the RFP algorithm. The per-format performance of ‘normal’ and ‘malware’ classes are depicted in Figs 4, 5, 6, 7, 8, and 9. There are several key observations. First, among the token-level algorithms such as MLM and RFP, our proposed RFP generally shows superior performance; for example, for the ‘normal’ class in PDF format, the RFP algorithm showed approximately 0.15%p higher F1 score than the MLM algorithm. Likewise, among the sequence-level algorithms, our SSP outperforms the NSP algorithm. Second, compared to the token-level algorithms, the sequence-level algorithms (e.g., NSP and SSP) give the worst performance when they are adopted alone without combinations. This is reasonable as many previous studies found that the NSP algorithm may not contribute much to the down-stream performance. Third, the best combination is the MLM+SSP. This can be explained that the SSP algorithm makes the model to incorporate intra-sequence information so that the model is able to deal with malicious operations scattered over a sequence. This eventually allows the model to predict the original bytes more accurately in the MLM algorithm.
Comparison to CNNs and a bigger language model
Since previous models for malware detection using byte sequences have primarily been convolutional neural networks (CNNs), we compared our best model (i.e., ByteEnc pre-trained using MLM+SSP) with two CNN-based models: MalConv [10] and SPAP [21]. Both CNNs were trained using cross-entropy loss for 10 epochs, and their drop-out rate was 0.5, batch size was 10, and we followed default settings for other hyperparameters. Like ByteEnc, the input length for the CNNs was set to 512. The ByteEnc model used in our experiments was the size of BERT-small (29M parameters). Additionally, we tested a larger ByteEnc model with the same size as BERT-base (110M parameters) featuring 12 encoder layers. Table 4 summarizes the results of the CNNs and ByteEnc models of different sizes. Compared to the CNNs, the ByteEnc models consistently demonstrated significantly better performance across all file formats, as reflected in the Macro F1 scores. This result suggests that ByteEnc effectively transfers the knowledge of byte sequences learned during pre-training to the malware detection task. One potential concern is the relatively lower performance of the HWP format. However, as mentioned earlier, HWP files tend to have significantly longer sequences than other formats, making malware detection more challenging since malicious actions may be spread across these long sequences. In future work, we plan to extend our approach to support longer sequence lengths. Another noteworthy observation is that ByteEnc (BERT-base) generally outperforms ByteEnc (BERT-small). Although ByteEnc (BERT-base) shows slightly lower performance in the PDF format, it achieves greater performance gains in the other two formats. Notably, both ByteEnc (BERT-base) and ByteEnc (BERT-small) were pre-trained on the same dataset, suggesting that a larger language model is better at learning and capturing the sequential patterns underlying byte sequences, ultimately leading to improved performance. This is also supported by Fig 10, which shows that the training loss of ByteEnc (BERT-base) has decreased more than that of ByteEnc (BERT-small).
Conclusions
In this paper, we introduced a new language model that is pre-trained with only byte sequences, thereby protecting Web users from the threat of malicious attacks. The language model, ByteEnc, is pre-trained with the byte sequences of HWP documents, but we demonstrated that it worked well for malware detection for other different formats such as PDF and MS Office. We also designed new pre-training algorithms and investigated combinations of the pre-training algorithms. We conducted extensive experiments and demonstrated that ByteEnc significantly outperformed previous convolutional neural networks (CNNs) with a large performance gap. We believe our language model will serve as a foundation model for malware detection services for non-executables.
Although we experimented with ByteEnc in two sizes (e.g., BERT-base and BERT-small), there is still ample room for performance improvement with much larger models. In this paper, we set the input length to 512, which may not be sufficient to capture malicious actions across the byte sequences, potentially resulting in overall performance degradation in practical applications. In particular, despite high precision for the normal class, there remains a risk that its deployment in real-world security services could lead to significant financial and societal costs because of the low reliable performance (i.e., low sensitivity). We believe that the fundamental solution for this limitation is to keep collecting more balanced dataset, so that the models learn more about the malicious cases.
In future work, we plan to extend our research to develop a more powerful encoder-based model that can process longer byte sequences. Furthermore, we will continue collecting and annotating balanced dataset, and we expect that pre-training on larger datasets with more diverse formats will make a significant contribution to advancements in the field of malware detection.
References
- 1. Jeong Y-S, Mswahili ME, Kang AR. File-level malware detection using byte streams. Sci Rep. 2023;13(1):8925. pmid:37264210
- 2.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA. 2017. p. 5998–6008.
- 3. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprints 2018. http://arxiv.org/abs/1810.04805
- 4.
Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018.
- 5.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019.
- 6.
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P. Language models are unsupervised multitask learners. 2020.
- 7. Zhang P, Zhao H, Shao Z, Xie X, Hu H, Zeng Y, et al. Enhanced multi-scenario running safety assessment of railway bridges based on graph neural networks with self-evolutionary capability. Engineering Structures. 2024;319:118785.
- 8. Zhang P, Zhao H, Shao Z, Xie X, Hu H, Zeng Y, et al. A novel graph neural network framework with self-evolutionary mechanism: application to train-bridge coupled systems. Advances in Engineering Software. 2024;197:103751.
- 9. Zhang X, Xie X, Tang S, Zhao H, Shi X, Wang L, et al. High-speed railway seismic response prediction using CNN-LSTM hybrid neural network. J Civil Struct Health Monit. 2024;14(5):1125–39.
- 10.
Raff E, Barker J, Sylvester J, Brandon R, Catanzaro B, Nicholas C. In: Proceedings of the Workshops of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, USA. 2018. p. 268–76.
- 11. Jeong Y-S, Woo J, Kang AR. Malware detection on byte streams of PDF files using convolutional neural networks. Security and Communication Networks. 2019;2019:1–9.
- 12.
şahın N. Malware detection using transformers-based model gpt-2. 2021.
- 13. Demirci D, Sahin N, Sirlancis M, Acarturk C. Static malware detection using stacked BiLSTM and GPT-2. IEEE Access. 2022;10:58488–502.
- 14. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9(8):1735–80.
- 15.
Rahali A, Akhloufi MA. MalBERT: using transformers for cybersecurity and malicious software detection. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Melbourne, Australia. 2021. p. 3226–31.
- 16. Demirkıran F, Çayır A, Ünal U, Dağ H. An ensemble of pre-trained transformer models for imbalanced multiclass malware classification. Computers & Security. 2022;121:102846.
- 17. Clark JH, Garrette D, Turc I, Wieting J. Canine: pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics. 2022;10:73–91.
- 18. Rahali A, Akhloufi MA. MalBERTv2: code aware BERT-based model for malware identification. Big Data and Cognitive Computing. 2023;7(2):1–33.
- 19. Jeong YS, Woo J, Kang AR. Malware detection on byte streams of hangul word processor files. Applied Sciences. 2019;9(23):1–13.
- 20. Jeong Y-S, Lee S-M, Kim J-H, Woo J, Kang AR. Malware detection using byte streams of different file formats. IEEE Access. 2022;10:51041–7.
- 21. Jeong YS, Woo J, Lee S, Kang AR. Malware detection of hangul word processor files using spatial pyramid average pooling. Sensors. 2020;20(18):1–12.
- 22.
Shaw P, Uszkoreit J, Vaswan A. Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana; 2018. p. 464–8.
- 23.
Press O, Smith NA, Lewis M. Train short, test long: attention with linear biases enables input length extrapolation. In: Proceedings of the 10th International Conference on Learning Representations. Virtual event. 2022.
- 24. Su J, Ahmed M, Lu Y, Pan S, Bo W, Liu Y. RoFormer: enhanced transformer with rotary position embedding. Neurocomputing. 2024;568.
- 25.
Kazemnejad A, Padhi I, Ramamurthy KN, Das P, Reddy S. The impact of positional encoding on length generalization in transformers. In: Proceedings of the thirty-seventh Annual Conference on Neural Information Processing Systems, New Orleans, USA. 2023.
- 26.
Clark K, Luong MT, Le QV, Manning CD. ELECTRA: pre-training text encoders as discriminators rather than generators. In: Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
- 27.
Aroca-Ouellette S, Rudzicz F. On losses for modern language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 4970–81. https://doi.org/10.18653/v1/2020.emnlp-main.403
- 28. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D. RoBERTa: a robustly optimized BERT pretraining approach. 2019. https://arxiv.org/abs/1907.11692
- 29.
Loshchilov I, Hutter F. Decoupled weight decay regularization. In: Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA. 2019. p. 1–19.
- 30. Chung T, Xu B, Liu Y, Ouyang C, Li S, Luo L. Empirical study on character level neural network classifier for Chinese text. Engineering Applications of Artificial Intelligence. 2019;80:1–7.
- 31.
Aoki T, Kitada S, Iyatomi H. Text classification through glyph-aware disentangled character embedding and semantic sub-character augmentation. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Suzhou, China. 2020. p. 1–7.
- 32. Si C, Zhang Z, Chen Y, Qi F, Wang X, Liu Z, et al. Sub-Character Tokenization for Chinese pretrained language models. Transactions of the Association for Computational Linguistics. 2023;11:469–87.