Figures
Abstract
Software defect prediction is a technology that uses known software information to predict defects in the target software. Generally, models are built using features such as software metrics, semantic information, and software networks. However, due to the complex software structure and the small number of samples, without effective feature representation and feature extraction methods, it is impossible to fully utilize software features, which can easily lead to misjudgments and reduced performance. In addition, a single feature cannot fully characterize the software structure. Therefore, this research proposes a new method to efficiently and accurately represent the Abstract Syntax Tree(AST) and a model called MFA(Multi Features Attention) that uses a deformable attention mechanism to extract features and uses a self-attention mechanism to fuse semantic and network features. By selecting 21 Java projects and comparing them with multiple models for cross-version and cross-project experiments, the experiments show that the average ACC, F1, AUC of the proposed model in the cross-version scheme reach 0.7, 0.614 and 0.711. In the cross-project scheme, the average ACC, F1 and AUC are 0.687, 0.575 and 0.696. Up to 41% better than other models, and the results of fusion features are better than those of a single feature, showing that MFA using two features for extraction and fusion has greater advantages in prediction performance.
Citation: Qiu S, E B, He J (2025) Features extraction and fusion by attention mechanism for software defect prediction. PLoS ONE 20(4): e0320808. https://doi.org/10.1371/journal.pone.0320808
Editor: Dola Sundeep, IIIT Kurnool: Indian Institute of Information Technology Design and Manufacturing Kurnool, INDIA
Received: January 15, 2025; Accepted: February 24, 2025; Published: April 14, 2025
Copyright: © 2025 Qiu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All code and dataset files are available from the github database (https://github.com/ebcwind/MFA/).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Software defect prediction generally uses three types of features: software metrics, semantic features, and network features. Software metrics mainly refer to hand-crafted features, such as McCabe and CK that measure software structures. In this case, a file often has a fixed value, so it has low robustness and poor model transferability in experimental scenarios such as cross-version and cross-project. Using semantic features and network features often has better performance, but generally requires corresponding deep learning technology processing [1].
Among all the semantic features, such as Source Code, Control Flow Graph(CFG), Abstract Syntax Tree(AST) is widely used [2]. As a pre-check method for compilers, it can eliminate syntax problems in source code and increase the accuracy of semantic features. The common method for processing semantic features using AST is to traverse the AST, then either perform word embedding on the traversal sequence or represent it as an integer sequence, which is input into the model for learning. However, the tree structure of AST cannot be effectively represented before deep learning, which can easily cause semantic feature loss.
Some studies [3,4] introduce CFG to supplement the missing structures in AST traversal sequences or treat AST as graphs. They generally use graph embedding or neural networks similar to GNN to process graph structure, which not only increases the complexity of feature representation but also involves a considerable amount of hyperparameters and workload. Other studies tend to optimize the AST structure to reduce its complexity. Quantification or decomposition of the AST is a common approach. Munir et al. [5] designed 32 types of statement-level metrics to measure the AST structure. Jiang et al. [6] pruned the AST to reduce the depth. These methods have increased the model’s processing capability and learning speed, but they have also compromised the integrity of the AST structure, resulting in bias.
Semantic features are only one aspect of the software. They primarily define the software’s functionality and determine the usability of a particular module. Software is an interconnection and combination of multiple modules; thus, it is also necessary to consider the relationships between the modules to achieve the ideal goal of high cohesion and low coupling. The software network, as a supplement to AST, is receiving increasing attention from researchers.
So, the proposed method fully explores the structural information of AST by combining three different traversal methods to obtain the spatial information of AST, which obtains more effective features than the traditional single traversal sequence. After these sequences are embedded, they are input into the transformer with deformable attention mechanism as three channels. The embedded network features and the formal processed features are weighted by the self-attention mechanism, then the fully connected network is used for binary classification.
In cross-version and cross-project scenarios, batch oversampling is used to solve the problem of unbalanced data sets and improve prediction performance. The experiments are conducted in 21 open source projects. The results showed that it has better performance than the state-of-the-art methods in most cases. The main contributions of this paper are as follows.
- Proposing triple traversal sequences for AST representation, convert the spatial information of AST into token sequences, and fully represent the features of AST.
- The Transformer with a deformable attention mechanism is used to process the word embedding matrix to discover defect patterns of different scales.
- The weights of semantic features and network features are determined through the self-attention mechanism to predict defects in files.
Related work
Software defect prediction
Software defects refer to statements, modules, or structures within the software that cause the software’s functions to operate abnormally, affecting its performance or efficiency. Defect prediction can help engineers assess the software condition during the early stages of the software lifecycle, thereby avoiding increases in costs and workload. However, due to the relatively fewer defects compared to normal samples, the sample class is extremely unbalanced, and some software modules have a limited number and a dispersion of defect patterns, which increases the difficulty of prediction. Moreover, cross-version and cross-project predictions are often more challenging, as different coding habits and software positioning pose greater challenges to the accuracy of defect prediction.
The software defect prediction model consists of several components, including feature acquisition, training, prediction, and evaluation. Early work mainly focused on feature selection and search algorithms, during which software metrics were used as input, and traditional machine learning served as the prediction model. It was not until the rise of deep learning that research began to favor feature extraction techniques for processing source code. At the same time, the learning models transitioned from machine learning to deep learning, such as convolutional neural network(CNN), long short-term memory network(LSTM), and Transformer. Several studies using deep learning have indicated that their performance, after evaluation, surpasses that of traditional machine learning [7].
Multiple features prediction
In recent years, software defect prediction methods have shifted from single features to multiple features. And feature types have also changed from traditional metrics to semantic features and network features.
Shi et al. [8] extracted features from AST and software networks. Similar to most studies, they only used three types of AST nodes, with semantic features extracted by CNN and structural features extracted by skip-gram, and the two features were directly concatenated and input to classifier. Cross-version experiments on six Java datasets and four models showed that the hybrid features achieved a higher F1 score than traditional hand-crafted features or single semantic features or structural features. Average F1 reaching 0.560.
Wang et al. [9] proposed a method based on a gated hierarchical long short-term memory network (gh-LSTM). This method uses LSTM to extract features from word embeddings of AST and traditional metrics. The two features are then connected through a gated fusion layer for defect prediction. The results show that the average F1 is 0.612. It is higher than that of other methods. Tao et al. [10] used BiLSTM to extract semantic information from sequences of AST tokens and code change tokens. They also connected semantic features with traditional metrics and used a gated fusion mechanism to determine the combined ratio of the two features. Cross-version results showed the average F1 and MCC of the method reached 0.633 and 0.399. Their work indicate that multi-feature fusion has certain advantages.
As a contrast, Cui et al. [11] used metrics as network node attributes, established a complex software network, and used a group detection algorithm to divide the graph into multiple subgraphs. They learned the representation vectors of nodes through an improved graph neural network model. Finally, they used the node representation vectors to classify software defects. However, they only made predictions within the project, and the performance improvement was small compared to other models. It indicates that the combination of network features and software metrics does not have significant advantages.
Yu et al. [12] built a deep learning model for defect prediction based on self-attention mechanism (DPSAM) to automatically extract semantic features and perform defect prediction. They parsed the source code and embedded the AST to integers through the multi-head attention for learning. Zhao et al. [13] combined AST and CFG by dataflow. They designed AGN4D (attention-based graph neural network for directed graphs) to extract contextual features from the directed graph. The features and local features are added together, then pooling and predicted by the classifier. Compared to many models, their average F1(0.54) and AUC(0.675) improved significantly. Both articles show that the semantic features have good performance within and across projects.
These multi-feature software defect predictions have shown that the SDP performance is affected to a certain extent by the type of features, feature extraction technology and fusion methods. Apart from the relatively small improvement in the combination of network features and metrics, the combinations of other features have certain advantages. Many studies used fewer types of AST node and had low efficiency in AST representation and extraction. Feature fusion is merely a simple addition or concatenation, leading to a loss of information in the feature space. So the proposed method introduces more node types and attributes, and a more efficient representation method has been designed. A deformable attention mechanism is employed for the rapid extraction of complex features. At the same time, spatial information is preserved during feature fusion. Consequently, the method increases feature robustness and improve model prediction performance.
Methodology
Our MFA method consists of three main stages: source code preprocessing, feature processing, feature fusion and prediction. In the first stage, AST and CDN are constructed. In the second stage, glove and proNE methods are used to embed the parsed AST and CDN respectively. Semantic features are extracted through ResNet50 and deformable attention mechanism. The dimension of network features is expanded to match with semantic features. Finally, the self-attention mechanism is used to fuse features for defect prediction. The MFA framework is shown in Fig 1.
Pre-processing
Pre-processing includes parsing the source code and representing its features. For semantic features, the source code is parsed into AST and traverse AST in three ways. For network features, CDN is built and represented by node pairs.
Triple traversal sequences of AST.
AST is generated by parsing the source code. In many researches [8–10,12], a single traversal is generally used to represent the AST tree. However, for different software structures, similar single traversal sequences will appear, resulting in a decrease in model resolution. As shown in Fig 2, left code is a correct program, while the right may fall into an infinite loop, but their single traversal sequences are the same because the “i + + ” fragment is at the end of the while loop. The AST of the program is shown below the code.
Therefore, we propose three AST traversal sequence methods, namely root-first, leaf-first and level-order traversal. Root-first is depth-first traversal, which starts from the root, walks to the leaf node of the path and then walks from the previous forked node to the leaf node of the path, and so on, until all nodes are visited. Leaf-first visits the leaf node first and then visits its root node. If the leaf node is not visited, its root node will never be visited. Level-order traversal is breadth-first traversal, which traverses according to the hierarchical structure of the tree. The latter two traversal methods will provide more details for single traversal, and they can be combined to represent a unique program structure.
The results of the three traversal of the program in Fig 2 are shown in Table 1. It can be seen from the table that the position of "StatementExpression" in the leaf-first or level-order has changed, indicating that they can distinguish the AST structure shown in Fig 2. But they still have their focus. Leaf-first gathers all leaf nodes together, which is similar to clustering operations, while level-order is better at hierarchical structures and has a better grasp of hierarchies.
For traversal types, include not only the AST definition node types but also the attributes under the nodes, such as method types and modifiers, but remove some elements that cannot be transferred for learning, such as method names and package names. They will cause a burden to identify defects in cross-versions or cross-projects scenario.
Class dependency network.
Java is an object-oriented programming language and the software written in it is based on classes. There are relationships between classes, such as references, object creation, inheritance and interface implementation, which are related to the software structure and function. Therefore, a class can be regarded as a node. As shown in Fig 3, class A calls class B, class B inherits C and implement interface D, and C not only inherits E, but also references A. The complex relationship between them can be converted into a corresponding graph. The dependencies between these classes are converted into the corresponding directed edges. The directly dependent classes are extracted in each file, and different classes are denoted as corresponding integers. Each class has a unique corresponding integer. Then, the edges between classes are recorded with integer pairs to prepare for network embedding.
Features processing
Features process includes two parts: feature embedding and feature extraction. Among them, semantic feature extraction is particularly important. Since the pattern of the triple traversal sequences is more complex, professional deep learning is required to process it to obtain defect features.
Features embedding.
For semantic feature, Global Vectors for Word Representation(Glove) [14], an unsupervised learning algorithm is used to obtain vector representations of words. Training is based on the global word-word co-occurrence matrix of the corpus, and the generated vectors show a linear substructure of the word space. The embedding principle is shown in Eq 1 and Eq 2.
Where V is the size of the vocabulary, f(x) is the weight function, represents the number of times j appears in the context of i.
,
both represent word vectors.
,
are biases. α and
are adjustable parameters of the weight function. α is 0.75 generally.
has little impact on the performance of the model. It is fixed to 100 here.
As an unsupervised word embedding model, Glove requires more samples and data than supervised models. Therefore, in this method, the three sequences after traversing each file of the selected 21 projects are input into the word embedding model. The datasets include different projects and different versions.
For network feature, the ProNE network embedding method is involved. Network embedding is a technique that expresses network nodes or sub-networks in the form of vectors so that the model can be used directly. Among existing network embedding techniques, ProNE generally has better performance [15].
ProNE [16] expresses the network embedding as a sparse matrix factorization to effectively implement the initial node representation. Secondly, it uses high-order Cheeger inequalities to modulate the spectral space of the network and propagate the learned embedding in the modulated network, while integrating local smoothing and global clustering information.
Feature extraction.
Due to the complex and redundant structure of semantic features, defects are more weakly associated with them, and the defect structure is often scattered and irregular. However, deep learning technologies such as CNN and LSTM, which are good at learning spatial temporal regularities, have slow convergence and inaccuracy problems. Extremely unbalanced software samples increase the difficulty of defect prediction. Therefore, the deformable attention mechanism [17] is used to extract semantic features. This method relies on the transformer and its attention mechanism.
The attention mechanism can effectively capture the relationship between distant elements in a sequence without being restricted by the sequence length. This is why it is favored by researchers in this field, as the lengths of the AST vary, and the defect patterns require a comprehensive judgment of long-distance phrases.
Attention calculated by three vectors, named Query, Key and Value. The Query at each position is dot-producted with the Key at all positions to get an attention score. These scores are then used to weight the Value to generate the output for each position. This global calculation allows each position to connect directly to other positions. But the problem with normal attention mechanism is that they require a large number of samples to converge. This is fatal for software with a small sample size, and deformable attention solves this problem.
In the encoder stage, the value of each point in the feature vector is determined by a specific number of surrounding points, and the positions of these points are also learned. By assigning a small fixed number of keys to each query, the problems of convergence and feature space resolution can be alleviated.
Given an input feature map , q represents query, then the feature of q is
, the reference point of q is
, and the deformable attention feature formula is
Where m, k and K represent the attention head, sampled key, and the total number of sampled keys. is the projection matrix of input value, and W is output projection matrix.
and
represent the offset and attention weight of the kth sampling point of the mth attention head, respectively. Both
and
are linear projections obtained from the query feature
. In this process, the query feature
is linearly mapped, the sampling point offset
is determined by the floating point numbers in the xy direction, and the attention weight
is obtained by
after softmax. The principle is shown in Fig 4.
In the encoder stage, the calculation speed is related to the size of the feature space and the number of attention points, while in the decoder stage, the complexity is independent of the space, so a lot of calculation time is saved. After the semantic features pass through Resnet50, the outputs of the first, second, and third layers are used as the input of the Transformer, and attention is applied between the features of these different scale layers to enhance the performance of the features. At the same time, the output of the three decoders are input into the self-attention mechanism to participate in the weight allocation of the fusion features.
For embedded network features, CNN is used to expand its channels to make them consistent with the dimensions of semantic features to enhance their competitiveness, and then input them into the self-attention mechanism for feature fusion.
Features fusion and prediction
After processing, the semantic features and network features have different emphases. The degree of association with defects is also different. Therefore, when fusion of features is performed, it is necessary to determine the weight distribution of different features without destroying the feature space structure.
To solve this problem, the channel attention in the Convolutional Block Attention Module(CBAM) [18] is introduced, as shown in Fig 5. CBAM determines the weight of each channel through channel convolution, pooling and multilayer perceptron training. Then assigns weights to different channels. When the weight is large, the value of the channel feature map increases accordingly. In contrary, the value of the channel feature map will be smaller, inhibiting its activation. Eq 4 shows the changes of the feature after passing through the CBAM attention module.
Experiments
Datasets
21 Java datasets from the PROMISE database are selected for experiments. According to statistics, Java-written software is the most widely used in defect prediction [1,2], providing more options for comparative experiments. AST is generated by Javalang, a normally used python package to parse the source code, and use a software named Class Dependency Analyzer to extract class dependency to construct CDN. According to the source code and dataset file names, most of the source code is parsed in the experiment, but some files still failed to parse due to not existing and syntax errors. Therefore, the experimental dataset is a simplified version of the original dataset, retaining all correct and searchable files. The details of the datasets are shown in Table 2.
Settings
This model is programmed in the environment of Python 3.9, PyTorch 2.0.1 and Cuda 11.8. The experiment is carried out in two scenarios. In the cross-version scenario, the training set and the test set come from different versions of the same software. The training set is the version with a smaller version number, while the test set is the version with a larger version number. The old version can be used to predict defects in the new version. For example, ant1.4 is used to predict defects in ant1.6, and ant1.6 is used to predict defects in ant1.7.
In the cross-project scenario, all versions of a project are used as training sets in the experiment, and all versions of other projects are used as test sets for prediction. For example, ant1.4, ant1.6, and ant1.7 are merged into one training set, camel, jedit, lucene, poi, velocity, and xalan are used as test sets. The integration of multiple versions and projects is conducive to enriching the types of code structures.
Since the dataset for software defect prediction is class-imbalanced, batch oversampling is performed to balance the dataset. Batch oversampling can determine the ratio of positive and negative examples in each batch of training sets according to the set sampling ratio. The hyperparameters in the experiment are shown in Table 3.
A total of 6,109 words were used for word embedding. To determine the embedding dimensions and window sizes, dimensions of 16, 32, 64, 128, 256 were set, along with window sizes of 2, 3, 4, 5, 6, resulting in five groups of experiments. According to Fig 6, the loss for the first 1,000 batches shows that the loss is lowest when the embedding dimension is 64. Additionally, the embedding window at sizes 2 or 3 achieves better results, with size 3 being more stable to a certain extent. Under the condition of selecting the first two parameters, experiments conducted over 125 epochs indicate that the model essentially converges after 75 epochs. To be cautious, 100 epochs were chosen. For Transformer, retaining the original optimizer and learning rate yields better results in experiments. When the encoder and decoder exceed three, there is no improvement in the results; instead, it increases computation time. The model can converge in no more than 20 epochs during training.
Baseline models
BiLSTM: Dam et al. [19] input AST sequence into LSTM for word embedding, then input the embedding vector into LSTM to extract syntax and semantics, and finally predict through classifier.
N2D: Qu et al. [20] use skip-gram to embed the sequence of CDN traversal, then combine the embedding features with traditional metrics and predict through classifier.
GCN: Zeng et al. [21] use node2vec network embedding and use traditional software code features as attributes of CDN nodes. Finally, CDN is input into GCN to obtain deep features and then classified.
CGCN: Zhou et al. [22] transform source code into AST and CDN. The integer vector of AST is input into CNN to capture semantic information, and the graph convolutional network (GCN) learns the structural information of CDN. The learned deep features are connected with traditional metrics and the classifier is trained to achieve more accurate defect prediction.
Evaluation metrics
Five common evaluation metrics are used to evaluate the performance of the prediction model, including accuracy, precision, recall, F1, and area under curve(AUC).
TP, TN, FN and FP come from the confusion matrix which is a tool used to evaluate the performance of classification models. Defect samples are considered positive. “T” denotes the prediction is correct and "P" denotes the samples are predicted as positive(defect). “F” denotes the prediction is false and “N” denotes the samples are predicted as negative(non-defect). For example, TP represents the number of actual defect samples predicted as defective.
AUC is the area under the receiver operating characteristic curve(ROC). The x-axis of ROC is the ratio of non-defective modules classified as defective(False Positive Rate, FPR). The y-axis is Recall. It represents the corresponding FPR and Recall values under different thresholds. The threshold is determined by the probability that different samples are predicted as defects. The closer the AUC value is to 1, the better the classifier’s discrimination ability is.
Results
Table 4 is the results of the cross-version experiment. The first two columns represent the training set and the test set respectively, and the last column is the sampling ratio, which indicates the proportion of defect-free samples in each batch. The results show that the average ACC of all cross-version experiments is 0.7, the average F1 value is 0.614, and the average AUC is 0.711, achieving good prediction performance. However, camel1.4-1.6 has the lowest F1 and AUC. The ACC value of velocity1.4-1.5 is the lowest among all cross-versions, followed by xalan2.4-2.5. The defect rates of these versions are at two extremes, either too high or too low. Oversampling cannot completely solve this problem, because duplicate samples may be sampled, resulting in a decrease in the model’s discriminative ability. The average sampling ratio is 0.468, indicating that most projects have better performance when close to 0.5, and balancing different types of samples is conducive to performance improvement.
Table 5 shows the results of the cross-project experiment. The first two columns show the training set and the test set, and the last column shows the ratio of defect-free samples in each batch of the training set. The average ACC is 0.687, the average F1 is 0.575, and the average AUC is 0.696. In all projects, camel has poor results as the target, and velocity has poor results as the training set. This is related to the defect rate of the dataset and the file structure and length of the two projects. This may be due to the purpose and background of the software. Further analysis is made in the comparative experiment.
Cross-version comparison experiment
The cross-version comparison experiment is shown in the Fig 7. Compared with other models, MFA has the highest average ACC, F1, and AUC. This shows that MFA has certain advantages in cross-version scenarios. In the prediction of projects such as ant and jedit, it shows high performance. The Acc value of Ant1.4-1.6 prediction reach 0.809, the F1 value of poi2.5-3.0 is 0.824, and the AUC of jedit4.0-4.1 prediction is 0.832, which shows that the model has high performance and has strong recognition ability for defect patterns of complex traversal sequences.
However, the results of camel1.4-1.6, velocity1.4-1.5, and xalan2.4-2.5 performed poorly compared to other models. The reasons are different. There are a large number of small file defects (AST sequence length is less than 50) in velocity1.4 and 1.5. As shown in Fig 8, in deep learning, small file defects are prone to losing details during downsampling, resulting in a decline in model performance. In camel1.4 and xalan2.4, the defect rate is too low and there are fewer defective samples, resulting in a decline in model performance.
Cross-project comparison experiment
The cross-project comparison results are shown in Fig 9. The average ACC, F1, and AUC of MFA all exceed those of other models, which are 0.687, 0.575, and 0.696, respectively. All versions of a project as a dataset for experiments, which is conducive to enriching the file structure and defect types in the dataset and increasing the number of cross-project samples, which is important for deep learning. The highest Acc, F1, and AUC are 0.798, 0.810, and 0.8, respectively, indicating that MFA has higher performance.
Among all cross-project predictions, camel’s prediction results are the worst, which is related to its data set. As can be seen from Fig 10, its small file (short sequence) defect ratio is large and the defect rate is small, which means that the quality and quantity of defect samples are insufficient. In feature extraction, since long and short files cannot be taken into account at the same time, small file feature extraction is insufficient, which in turn affects cross-project performance. When ant and jedit are used as training sets for cross-project prediction, the performance is often better, which is related to their lower short sequence defect ratio. In Fig 10, the defect ratio of files less than 50 in ant and jedit is the lowest compared to other projects. Because small files do not have enough defect patterns, the model has poor learning ability for them.
Discussion
In cross-version experiments, the datasets where MFA did not perform well are velocity, lucene, and poi. In particular, velocity is not the best in terms of Acc, F1, and AUC. It has been mentioned that its defect files mostly consist of small files in Fig 8. Although for adapting varying sample lengths, the padding length for each batch is minimized as much as possible, the longest sequence still reached 6721. Without truncating long sequences, downsampling must be employed. However, this inevitably leads to the loss of some details in the features of short sequences before deformable attention be performed. In velocity, more than half of the defect sequence lengths are less than 100, which affects the training effectiveness. The advantage shown by CGCN on velocity stems from CNN’s tolerance for short sequences.
The advantages of MFA are evident in cross-project tasks. It leads in almost all evaluation metrics and has stable performance across each project. This is attributed to the handling of long sequences, where the triple traversal sequences of AST enhance the distinction between files and deformable attention allowing for faster convergence with a limited number of samples. The GPU resources used throughout the process do not exceed 4G; however, the efficiency of deformable attention is significantly affected by the size of the input matrix. The MFA model has an overall parameter size of 25.4M, compared to the representative AST model, which is more than the TBCNN [23](0.5M) but less than code2seq [24](61M). However, it can achieve similar performance, with code2seq achieving an F1 of 0.592 across projects in large Java datasets.
To demonstrate the role of semantic embedding and network features, ablation experiments are carried out in two scenarios: cross-version and cross-project, and four evaluation metrics are used to represent the model performance, as shown in Fig 11. In addition, the Matthews Correlation Coefficient(MCC) is involved. MCC is more comprehensive and less biased than F1 [25]. MFA is the proposed model, MFA- represents a multi-feature model without glove word embedding, and SFA represents a model with pure semantic features and no glove.
In both scenarios, the four metrics of MFA perform better than the other two models. After adding network features and glove word embedding, the performance of semantic features gradually improves, indicating that the addition of the two modules is effective. Overall, the average performance of MFA-compared with SFA with the addition of network features has increased. This shows that the addition of network features is a good supplement to semantic features, which helps to improve the stability of overall prediction, but its improvement is limited. On individual data sets, such as jedit cross-version and velocity cross-project, the addition of network features reduces performance. After adding glove, the lower limit of cross-project performance increases more, reflecting its advantages. Word embedding provides more semantic information and increases the stability of semantic features.
From the perspective of MCC alone, semantic features show instability. Fusion with network features can obviously present more concentrated results, and glove further improves the performance of semantic features. However, the F1 value seems to be biased in the evaluation. F1 has abnormal points in cross-versions, and its fluctuation is greater than other metrics. Therefore, F1 may be unreliable in single-project evaluation, but the average F1 can reflect the performance of the model.
Limitation
AST nodes selection: In the experiment, custom nodes such as function names and operators are removed to ensure the transferability of models across versions and projects. This may cause the loss of some details of the AST, thus affecting the model’s capture of defective modules. In the future, we will try to retain all AST nodes and simplify similar nodes.
Batch oversampling setting: In the experiment, due to the different characteristics of each dataset, only through experiments can be known how its performance is under the specific ratio. Therefore, a step of 0.05 between 0.05-0.9 is used and 19 experiments are conducted for each dataset, which affected the experimental efficiency. In the future, we will use the sampling ratio of batch oversampling as a learnable variable to reduce unnecessary experiments.
Sequence length padding: In the experiment, padding is reduced as much as possible, and the file length is shortened to increase the training speed. However, since the lengths of files in each project vary, during training, all sequences are padded with 0 to the longest AST sequence in the training set and test set, so that all sequences are input into the model with the same length. This consumes a lot of space, causing the details of short sequence files to be lost during feature extraction, which affects the performance of the model to a certain extent. In the future, we will consider using different models to extract short and long sequences and finally make a comprehensive prediction.
Conclusion
In this research, a software defect prediction model MFA based on the attention mechanism is proposed. This model makes full use of the structural and semantic information of AST and CDN. After the two features are embedded, they are extracted and fused by the attention mechanism for defect prediction. The experimental results show that the average ACC, F1, AUC of MFA in the cross-version scheme reach 0.7, 0.614 and 0.711. And the average ACC, F1, AUC in the cross-project scheme are 0.687, 0.575 and 0.696, having more advantages than the LSTM model based on a single feature and the GNN model based on multiple features, indicating that this method has better prediction performance by utilizing attention mechanism to capture subtle defect patterns in semantic and network features. The ablation experiment also shows that network features and semantic features can complement each other and jointly improve the performance of software defect prediction. There is still some work worth further study on this method, such as the full representation of source code information, the unification of sequence length and cross-language prediction.
References
- 1. Giray G, Bennin KE, Köksal Ö, Babur Ö, Tekinerdogan B. On the use of deep learning in software defect prediction. J Syst Softw. 2023;195:111537.
- 2. Qiu S, E B, He J, Liu L. Survey of software defect prediction features. Neural Comput Applic 2024;37(4):2113–44.
- 3. Abdu A, Zhai Z, Abdo HA, Algabri R. Software defect prediction based on deep representation learning of source code from contextual syntax and semantic graph. IEEE Trans Rel 2024;73(2):820–34.
- 4.
Liu H, Li Z, Zhang H, Jing X-Y, Liu J. CFG2AT: Control flow graph and graph attention network-based software defect prediction. IEEE Trans Rel. 2025:1–15. https://doi.org/10.1109/tr.2024.3503688
- 5. Munir HS, Ren S, Mustafa M, Siddique CN, Qayyum S. Attention based GRU-LSTM for software defect prediction. PLoS One 2021;16(3):e5017444. pmid:33661985
- 6. Jiang S, Chen Y, He Z, Shang Y, Ma L. Cross-project defect prediction via semantic and syntactic encoding. Empir Softw Eng. 2024;29(4).
- 7. Nevendra M, Singh P. A survey of software defect prediction based on deep learning. Arch Comput Methods Eng 2022;29(7):5723–48.
- 8. Meilong S, He P, Xiao H, Li H, Zeng C. An approach to semantic and structural features learning for software defect prediction. Math Probl Eng. 2020;2020:1–13.
- 9. Wang H, Zhuang W, Zhang X. Software defect prediction based on gated hierarchical LSTMs. IEEE Trans Rel 2021;70(2):711–27.
- 10. Tao C, Wang T, Guo H, Zhang J. An approach to software defect prediction combining semantic features and code changes. Int J Soft Eng Knowl Eng 2022;32(9):1345–68.
- 11. Cui M, Long S, Jiang Y, Na X. Research of software defect prediction model based on complex network and graph neural network. Entropy (Basel) 2022;24(10):1373. pmid:37420393
- 12.
Yu T, Huang C, Fang N. Use of deep learning model with attention mechanism for software fault prediction. In: 2021 8th international conference on dependable systems and their applications (DSA); 2021. p. 161–71.
- 13. Zhao Z, Yang B, Li G, Liu H, Jin Z. Precise learning of source code contextual semantics via hierarchical dependence structure and graph attention networks. J Syst Softw. 2022;184:111108.
- 14.
Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. p. 1532–43.
- 15. Qu Y, Yin H. Evaluating network embedding techniques’ performances in software bug prediction. Empir Softw Eng 2021;26(4):60.
- 16.
Zhang J, Dong Y, Wang Y, Tang J, Ding M. ProNE: Fast and scalable network representation learning. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence; 2019. p. 4278–84.
- 17.
Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable DETR: Deformable transformers for end-to-end object detection. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 1–12.
- 18.
Woo S, Park J, Lee JY, Kweon IS. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.
- 19. 2881961Dam H, Tran T, Pham T, Ng S, Grundy J, Ghose A. Automatic feature learning for predicting vulnerable software components. IEEE Trans Softw Eng 2021;47(1):67–85.
- 20.
Qu Y, Liu T, Chi J, Jin Y, Cui D, He A. node2defect: using network embedding to improve software defect prediction. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering; 2018. p. 844–9.
- 21.
Zeng C, Zhou C, Lv S, He P, Huang J. GCN2defect: Graph convolutional networks for SMOTETomek-based software defect prediction. In: 2021 IEEE 32nd international symposium on software reliability engineering (ISSRE); 2021. p. 69–79.
- 22. Zhou C, He P, Zeng C, Ma J. Software defect prediction with semantic and structural information of codes based on Graph Neural Networks. Inform Softw Technol. 2022;152:107057.
- 23.
Mou L, Li G, Zhang L, Wang T, Jin Z. Convolutional neural networks over tree structures for programming language processing. Available from: http://arxiv.org/abs/1409.5718; 2015.
- 24.
Alon U, Brody S, Levy O, Yahav E. code2seq: Generating sequences from structured representations of code. Available from: http://arxiv.org/abs/1808.01400; 2019.
- 25. Yao J, Shepperd M. The impact of using biased performance metrics on software defect prediction research. Inform Softw Technol. 2021;139:106664.