Fig 1.
Overview of the methodology, showing section numbers containing results.
Fig 2.
Overview of the DroidDissector feature extraction tool used in this study.
Table 1.
Comparison of samples and feature counts in Peiravian and Zhu’s and our dataset, showing differences in data set size and the increase in feature set sizes.
Table 2.
Result comparison between Peiravian and Zhu [16] and our reimplementation, showing the effect of feature type (Perm: permissions alone, API: API calls alone, Combined: permissions and API calls) and ML model choice on performance metrics.
For this and subsequent results tables, the figures on the left are reproduced from the original publication, and those on the right are from our reimplementation. Where values on the left are missing (-), this indicates that these metrics were not published in the original study.
Table 3.
Comparison of samples and total number of normal and dangerous permissions in Wang et al.’s and our dataset.
Table 4.
Result comparison between Wang et al. [17] and our reimplementation, showing the effect of feature selection algorithm choice.
The original study used SVM; for the results of our reimplementation, we also show the best-performing model in each case. Accuracy and F1-scores for the original study are approximate values as they were read from plots provided by the authors.
Fig 3.
Relationship between number of permissions used and model accuracy when permissions are ranked using different methods.
Table 5.
Result comparison between Rathore et al. [18] and our reimplementation, showing the effect of model choice for permissions-based models, and the most effective feature selection method for reducing the number of permissions for each ML model type.
Fig 4.
Effect of changing the variance threshold upon accuracy of random forest models.
Also shows the number of permissions selected for different variance thresholds.
Table 6.
Result comparison between Sahin et al. [19] and our reimplementation, showing the effect of model choice when permissions-based features are selected using a model-based feature selection algorithm.
Table 7.
Result comparison between Ma et al. [22] and our reimplementation, comparing the four different API call modelling approaches used in the original study.
Table 8.
The effect of varying the number of layers in the deep neural network for DNN models trained on API usage, API frequency and API sequence feature sets.
Table 9.
Comparison of different ML models using API usage and frequency feature sets.
Table 10.
Result comparison between Jung et al. [25] and our reimplementation, showing the performance of models trained just using the top 50 API calls found in malware and the top 50 API calls found in benign software.
Table 11.
Result comparison between Muzaffar et al. [26] and our reimplementation, showing the relative performance of different ML models trained using the full API usage feature set.
Fig 5.
Relationship between number of API calls used and model accuracy for API usage features ranked using various methods.
Table 12.
Result comparison between Muzaffar et al. [26] and our reimplementation, showing the relative performance of random forest models trained using API usage feature sets reduced using different feature selection algorithms.
For comparison, the corresponding results without feature selection (from Table 11) are also shown.
Fig 6.
Number of API usage features selected according to variance threshold.
Table 13.
Comparison of samples and Drebin feature counts in Arp et al’s original dataset and our dataset, also showing the accuracy of SVM models trained on these two datasets.
Table 14.
Comparison of models trained on reduced Drebin feature sets, using different feature selection algorithms, showing results for the most effective ML model in each case.
Table 15.
Number of unique n-opcodes for different values of n.
Table 16.
Number of selected n-opcodes using mutual information.
Fig 7.
Accuracy of ML models trained on usage and frequency-based n-opcode features selected using mutual information, showing the effect of changing the value of n.
Table 17.
Result comparison between Kang et al. [27] and our reimplementation, showing the effect of ML model choice when using usage and frequency-based n-opcode features, and also showing the best experimentally-derived value of n for each model.
F1-scores for the original study are approximate values as they were read from the authors’ plots.
Table 18.
Result comparison between Xiao and Yang’s [29] CNN model trained on image-based opcode features and our reimplementation.
Note that the authors of the original study reported separate precision and TPR figures when predicting malware or benign software; we have reported both for completeness.
Table 19.
Result comparison between Yeboah and Baz Musah’s [28] 1D CNN model trained on sequence-based opcode features and our reimplementation.
Table 20.
Result comparison between Ananya et al. [36] and our reimplementation, showing the effect of ML model choice when trained using system call features represented as unigrams, bigrams and trigrams.
The most effective feature selection method is shown in each case.
Fig 8.
Mean accuracy of ML models trained on system call features represented as unigrams, bigrams and trigrams when using mutual information and chi-square to select features, showing the effect of varying the feature count n.
Table 21.
Result comparison between Malik et al. [35] and our reimplementation, showing the performance of kNN models trained on system call frequency features and LSTM models trained on system call sequences.
Table 22.
Comparison of core ML models trained on usage and frequency-based system call features.
Table 23.
Result comparison between Afonso et al. [34] and our reimplementation, showing performance of ML models trained on a combined system and API call feature set.
Table 24.
Performance of RF models trained only on API calls, also comparing the benefit of usage and frequency-based features.
Table 25.
Result comparison between Zulkifli et al. [37] and our reimplementation, showing the performance of ML models trained using their TCP-based network traffic feature set.
Table 26.
Result comparison between Wang et al. [38] and our reimplementation, showing the performance of ML models trained using their TCP and HTTP-based network traffic feature sets.
For our reimplementation, we only show results of the best ML model for each feature set.
Table 27.
Performance of all core ML models trained on HTTP features or a combination of both TCP and HTTP features.
Note that models trained only on TCP features are shown in Table 25.
Table 28.
Result comparison between Kandukuru and Sharma [39] and our reimplementation, showing the performance of ML models trained on dynamic network traffic features and static permissions.
Fig 9.
Effect on ML model accuracies of reducing the number of system calls and permissions features using mutual information.
Table 29.
Reimplementation of Kapratwar et al.’s [41] approach, in which models are trained on static permissions and dynamic system call features.
Table 30.
Reimplementation of Kapratwar et al.’s [41] approach using reduced feature sets selected using mutual information.
Table 31.
Overall best-performing models, in each case showing the features, classifier model, and feature selection algorithm used.
Table 32.
Performance of ensemble models, showing the top five voting ensemble models formed from the overall best-performing base classifier models (Cn) listed in Table 31.
Table 33.
Summary of results, comparing originally published accuracies, F1 scores and TPRs (Original) against those of our reimplementations (Ours) and ensemble models.
Where multiple models were evaluated in a study, only the best result is shown for each metric. Where a metric was not reported in the original study, we have indicated —. Bold highlighting is used to show whether the original or reimplemented study produced a better result for each metric. The overall best values for each metric are underlined.