Reassessing feature-based Android malware detection in a contemporary context

doi:10.1371/journal.pone.0341013

Fig 1.

Overview of the methodology, showing section numbers containing results.

More »

Expand

Fig 2.

Overview of the DroidDissector feature extraction tool used in this study.

More »

Expand

More »

Expand

Table 1.

Comparison of samples and feature counts in Peiravian and Zhu’s and our dataset, showing differences in data set size and the increase in feature set sizes.

More »

Expand

Table 2.

Result comparison between Peiravian and Zhu [16] and our reimplementation, showing the effect of feature type (Perm: permissions alone, API: API calls alone, Combined: permissions and API calls) and ML model choice on performance metrics.

For this and subsequent results tables, the figures on the left are reproduced from the original publication, and those on the right are from our reimplementation. Where values on the left are missing (-), this indicates that these metrics were not published in the original study.

More »

Expand

Table 3.

Comparison of samples and total number of normal and dangerous permissions in Wang et al.’s and our dataset.

More »

Expand

Table 4.

Result comparison between Wang et al. [17] and our reimplementation, showing the effect of feature selection algorithm choice.

The original study used SVM; for the results of our reimplementation, we also show the best-performing model in each case. Accuracy and F1-scores for the original study are approximate values as they were read from plots provided by the authors.

More »

Expand

Fig 3.

Relationship between number of permissions used and model accuracy when permissions are ranked using different methods.

More »

Expand

Table 5.

Result comparison between Rathore et al. [18] and our reimplementation, showing the effect of model choice for permissions-based models, and the most effective feature selection method for reducing the number of permissions for each ML model type.

More »

Expand

Fig 4.

Effect of changing the variance threshold upon accuracy of random forest models.

Also shows the number of permissions selected for different variance thresholds.

More »

Expand

Table 6.

Result comparison between Sahin et al. [19] and our reimplementation, showing the effect of model choice when permissions-based features are selected using a model-based feature selection algorithm.

More »

Expand

Table 7.

Result comparison between Ma et al. [22] and our reimplementation, comparing the four different API call modelling approaches used in the original study.

More »

Expand

Table 8.

The effect of varying the number of layers in the deep neural network for DNN models trained on API usage, API frequency and API sequence feature sets.

More »

Expand

Table 9.

Comparison of different ML models using API usage and frequency feature sets.

More »

Expand

Table 10.

Result comparison between Jung et al. [25] and our reimplementation, showing the performance of models trained just using the top 50 API calls found in malware and the top 50 API calls found in benign software.

More »

Expand

Table 11.

Result comparison between Muzaffar et al. [26] and our reimplementation, showing the relative performance of different ML models trained using the full API usage feature set.

More »

Expand

Fig 5.

Relationship between number of API calls used and model accuracy for API usage features ranked using various methods.

More »

Expand

Table 12.

Result comparison between Muzaffar et al. [26] and our reimplementation, showing the relative performance of random forest models trained using API usage feature sets reduced using different feature selection algorithms.

For comparison, the corresponding results without feature selection (from Table 11) are also shown.

More »

Expand

Fig 6.

Number of API usage features selected according to variance threshold.

More »

Expand

Table 13.

Comparison of samples and Drebin feature counts in Arp et al’s original dataset and our dataset, also showing the accuracy of SVM models trained on these two datasets.

More »

Expand

Table 14.

Comparison of models trained on reduced Drebin feature sets, using different feature selection algorithms, showing results for the most effective ML model in each case.

More »

Expand

Table 15.

Number of unique n-opcodes for different values of n.

More »

Expand

Table 16.

Number of selected n-opcodes using mutual information.

More »

Expand

Fig 7.

Accuracy of ML models trained on usage and frequency-based n-opcode features selected using mutual information, showing the effect of changing the value of n.

More »

Expand

Table 17.

Result comparison between Kang et al. [27] and our reimplementation, showing the effect of ML model choice when using usage and frequency-based n-opcode features, and also showing the best experimentally-derived value of n for each model.

F1-scores for the original study are approximate values as they were read from the authors’ plots.

More »

Expand

Table 18.

Result comparison between Xiao and Yang’s [29] CNN model trained on image-based opcode features and our reimplementation.

Note that the authors of the original study reported separate precision and TPR figures when predicting malware or benign software; we have reported both for completeness.

More »

Expand

Table 19.

Result comparison between Yeboah and Baz Musah’s [28] 1D CNN model trained on sequence-based opcode features and our reimplementation.

More »

Expand

Table 20.

Result comparison between Ananya et al. [36] and our reimplementation, showing the effect of ML model choice when trained using system call features represented as unigrams, bigrams and trigrams.

The most effective feature selection method is shown in each case.

More »

Expand

Fig 8.

Mean accuracy of ML models trained on system call features represented as unigrams, bigrams and trigrams when using mutual information and chi-square to select features, showing the effect of varying the feature count n.

More »

Expand

Table 21.

Result comparison between Malik et al. [35] and our reimplementation, showing the performance of kNN models trained on system call frequency features and LSTM models trained on system call sequences.

More »

Expand

Table 22.

Comparison of core ML models trained on usage and frequency-based system call features.

More »

Expand

Table 23.

Result comparison between Afonso et al. [34] and our reimplementation, showing performance of ML models trained on a combined system and API call feature set.

More »

Expand

Table 24.

Performance of RF models trained only on API calls, also comparing the benefit of usage and frequency-based features.

More »

Expand

Table 25.

Result comparison between Zulkifli et al. [37] and our reimplementation, showing the performance of ML models trained using their TCP-based network traffic feature set.

More »

Expand

Table 26.

Result comparison between Wang et al. [38] and our reimplementation, showing the performance of ML models trained using their TCP and HTTP-based network traffic feature sets.

For our reimplementation, we only show results of the best ML model for each feature set.

More »

Expand

Table 27.

Performance of all core ML models trained on HTTP features or a combination of both TCP and HTTP features.

Note that models trained only on TCP features are shown in Table 25.

More »

Expand

Table 28.

Result comparison between Kandukuru and Sharma [39] and our reimplementation, showing the performance of ML models trained on dynamic network traffic features and static permissions.

More »

Expand

Fig 9.

Effect on ML model accuracies of reducing the number of system calls and permissions features using mutual information.

More »

Expand

Table 29.

Reimplementation of Kapratwar et al.’s [41] approach, in which models are trained on static permissions and dynamic system call features.

More »

Expand

Table 30.

Reimplementation of Kapratwar et al.’s [41] approach using reduced feature sets selected using mutual information.

More »

Expand

Table 31.

Overall best-performing models, in each case showing the features, classifier model, and feature selection algorithm used.

More »

Expand

Table 32.

Performance of ensemble models, showing the top five voting ensemble models formed from the overall best-performing base classifier models (C_n) listed in Table 31.

More »

Expand

Table 33.

Summary of results, comparing originally published accuracies, F1 scores and TPRs (Original) against those of our reimplementations (Ours) and ensemble models.

Where multiple models were evaluated in a study, only the best result is shown for each metric. Where a metric was not reported in the original study, we have indicated —. Bold highlighting is used to show whether the original or reimplemented study produced a better result for each metric. The overall best values for each metric are underlined.

More »

Expand