Figures
Abstract
Anomaly detection plays a crucial role in fields such as information security and industrial production. It relies on the identification of rare instances that deviate significantly from expected patterns. Reliance on a single model can introduce uncertainty, as it may not adequately capture the complexity and variability inherent in real-world datasets. Under the framework of model averaging, this paper proposes a criterion for the selection of weights in the aggregation of multiple models, employing a focal loss function with Mallows’ form to assign weights to the base models. This strategy is integrated into a random forest algorithm by replacing the conventional voting method. Empirical evaluations conducted on multiple benchmark datasets demonstrate that the proposed method outperforms classical anomaly detection algorithms while surpassing conventional model averaging techniques based on minimizing standard loss functions. These results highlight a notable enhancement in both accuracy and robustness, indicating that model averaging methods can effectively mitigate the challenges posed by data imbalance.
Citation: Zhao G, Wang L, Wang X (2025) A Mallows-like criterion for anomaly detection with random forest implementation. PLoS One 20(6): e0323333. https://doi.org/10.1371/journal.pone.0323333
Editor: Jie Zhang, Newcastle University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: November 14, 2024; Accepted: April 6, 2025; Published: June 6, 2025
Copyright: © 2025 Zhao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All datasets are available at https://zenodo.org/records/15226740 (DOI: https://doi.org/10.5281/zenodo.15226740).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Anomaly Detection (AD) [1–3] aims to identify anomalies from observed data. Its applications include financial analysis [4], cybersecurity [5], and paramedical care [6]. Traditional anomaly detection methods, such as Isolation Forest (IF), the Local Outlier Factor (LOF), and Gaussian Mixture Models (GMMs), often assume that anomalies are outliers or low-probability points, and they distinguish anomalies by attributes based on statistical properties and distance density [7].
These assumptions can be subject to two types of limitations. First, the inherent uncertainty associated with any single model can cause fitting issues. Second, detection is confined to a singular type of anomaly, which can be inefficient, especially when the true anomaly type differs significantly from that of the method.
Ensemble methods can account for different types of data and models, but conventional vote or mean approaches may prove inefficient for extremely imbalanced data. Model average methods [8–10] have been confirmed to improve overall prediction performance by combining the prediction results of multiple base models. Such methods are based on a minimum loss function [11, 12], and some are Bayesian [13]. Focal loss variational autoencoders have been combined with the XGBoost method, and this combination has achieved promising results when applied to imbalanced network traffic datasets. Other work has focused on feature selection methods for intrusion detection, as well as deep learning approaches and lightweight designs tailored for intrusion detection systems. Some researchers have intensively investigated model averaging methods for high-dimensional regression problems by removing weight constraints or handle the problem in the presence of responses missing at random, while others have investigated the improvement of Bayesian model averaging methods, such as combining a Bayesian model with the selection of regressors, to enhance interpretability and efficiency of Bayesian model averaging estimation. Some researchers have also explored the application of model averaging in the field of deep learning and proposed an efficient protocol for decentralized training of deep neural networks from distributed data sources, which has achieved good results in several deep learning tasks. In the model averaging method based on minimizing the loss function, the loss function is usually selected by choosing the logarithmic loss function, the squared loss function, and the cross-entropy loss function. [14–17]. The focal loss function [18] addresses class imbalance, particularly in object detection and image segmentation. It can assign different weights to samples or classes, but its application to model averaging, specifically in assigning weights to base models, has not been explored. Model averaging using Mallows’ criterion is known for its asymptotic optimality in linear regression problems and has been extended to some machine learning models, such as the random forest. However, model averaging has primarily focused on regression problems, with the goal of predicting continuous outcomes. While this has improved prediction accuracy and stability in regression tasks, it has seen little application to classification problems. This is problematic for datasets that show significant class imbalance. We aim to extend model averaging methods to better meet these challenges. Specifically, we adapt the Mallows criterion by substituting the conventional cross-entropy loss function with a focal loss function. This enables our ensemble of submodels to learn more effectively from minority classes without compromising performance on majority classes.
We propose optimizing the weights in model averaging by integrating a focal loss function into the Mallows criterion. Specifically, within a random forest framework, we introduce a complexity penalty term to the focal loss function, akin to Mallows averaging [19], and determine the weights for sub-decision trees by minimizing a Mallows-like criterion. This approach enhances performance on highly imbalanced datasets through the use of the focal loss function, which improves anomaly detection accuracy. It also controls model complexity and boosts generalization by incorporating a regularization term into the loss function and leverages model averaging to amalgamate the strengths and performance of various base models, assigning them distinct weights. Consequently, this method facilitates the development of a more precise anomaly detection model.
The proposed Mallows-like focal loss approach is compared with anomaly detection methods based on minimizing other loss functions, as well as commonly employed anomaly detection methodologies. The proposed methodology is evaluated using the AUC to assess binary classification performance, ARI to assess clustering algorithm performance, and the recall metric to assess the percentage of outliers that are detected. Evaluated on the KDDCup network intrusion dataset, the proposed approach shows a improvement in the F1-score over the suboptimal model averaging method based on minimizing the cross-entropy loss function and a
improvement of the recall metric. Our approach also shows superior performance to several common anomaly detection methods. Public benchmark datasets were also employed, with results indicating improved accuracy and stability in anomaly detection and extremely imbalanced data classification.
The main contributions of this paper are summarized as follows:
- We propose a Mallows-like averaging criterion to optimize the weights in the aggregation of multiple models; in particular, the focal loss function is instrumental in enhancing the performance of anomaly detection;
- Utilizing Mallows-like focal loss (MFL), we introduce a variant of the random forest algorithm, tailored for anomaly detection, within the framework of model averaging for optimal weight selection.
2 Proposed method
We consider anomaly detection as a binary supervised classification problem. We summarize the training sample as a set, , where
is a vector of predictors of dimension p; the response variable Yi = 1,
, indicates that the i-th sample is abnormal, and otherwise its value is zero. The relationships between all variables can be formulated as
where is unspecified, or even nonparametric. We suppose that all the residual errors
are independent and homogeneous, with
Notice that in Model (1), the p predictors can be randomly selected, thereby generating diverse models. Model averaging involves the weighted ensemble of models corresponding to each variable selection, i.e.,
followed by the minimization of a penalized loss function to optimize the selection of optimal weights within the unit simplex,
The focal loss is initially designed to address the object detection scenario for extremely imbalanced data, adding a modulating factor to the standard cross-entropy criterion,
where is the estimated probability for the class with label Y = 1. Incorporating focal loss into the ensemble method, we propose a Mallows-like criterion to determine the optimal weights. This Mallows-like criterion is achieved through a random forest algorithm. Fig 1 shows the framework of the proposed method, which is a model averaging structure based on minimizing the MFL criterion.
This minimizes the MFL criterion to allocate weights to base decision trees, mitigating the effects of data imbalance while controlling model complexity.
2.1 Mallows-like focal loss criterion
Hansen first investigated the Mallows criterion in least squares model averaging, selecting the weight vector as
where
is an unknown parameter to estimate, and Pm is the projection matrix in linear regression for the m-th model. The term
is defined as the effective number of parameters, i.e., the weighted average of the number of predictors in each submodel. The optimized function has two terms, the first measuring the fitting error of the weighted model, and the second penalizing model complexity.
By substituting the first term in the Mallows criterion with focal loss (4), we realize a criterion for anomaly detection,
where km is the number of predictors in the m-th base model. Depending on the implemented algorithm, can be relaxed to a function of the number of predictors in each base model.
Considering the random forest method and Least Squares Support Vector Classification (LSSVC), in the random forest, km is the number of internal nodes within the m-th trained decision tree, and in LSSVC, it measures the magnitude of support vector weights. For the m-th classifier, km takes the form
where the matrix H represents the two-dimensional mapping of support vectors relative to the entire sample set via the kernel trick, and indicates the strength of regularization.
Note that the unknown parameter in (6) represents the variance of the model in the Mallows criterion, which is replaced by
in practice.
2.2 Random forest with MFL criterion
The random forest is a popular classification method due to its flexibility and accuracy, where the voting mechanism is frequently utilized for data classification. By contrast, the first term in the MFL criterion (6) measures the fitting error of the weighted random forest in the training sample. The second term in (6) penalizes the complexity of trees in the forest, where is the weighted number of leaf nodes of all trees.
Utilizing the MFL criterion (6), we realize this algorithm as follows. We establish M decision trees, , and apply the Mallows-like criterion to optimize the weight, which is denoted by
. The weighted base models are linearly combined to form the overall model. Algorithm 1 shows the steps of the model averaging method.
Algorithm 1. Random forest with Mallows-like focal loss criterion.
There are two hyperparameters, α, γ, in focal loss function (4), and there are M subtrees in the random forest algorithm. We adopt Bayesian hyperparameter estimation methods [20] to expedite training and optimize the results. As the focal loss function represents a nonlinear constrained optimization problem, we employ sequential least squares [21] to optimize the modified focal loss function, thereby controlling complexity and computational costs.
3 Experiments
3.1 Main result
Anomaly detection typically addresses the issue of severe imbalance, where there is often no clear definition regarding the proportions of positive and negative samples. To expedite training and improve the positive-to-negative sample ratio for validating model effectiveness in anomaly detection, we employed simple random sampling for the minority class while controlling the ratio of positive to negative samples to be 0.05.
We first consider the publicly available KDDCup network intrusion dataset, which is used in the field of anomaly detection, whose 125, 870 records each have 41 features. We proportionally extracted 1, 089 records, where anomalies accounted for . Bayesian hyperparameter optimization was employed. We used this strategy with maximizing the AUC metric as the objective and sampled a subset of the data with a fixed size for each experimental run. Hyperparameters
,
, and w were tuned within the respective ranges of (1, 3), (0.5, 1), and (0.01, 0.05). Subsequently, in the MFL method,
,
, and w were set to 2.22, 0.61, and 0.049, through Bayesian hyperparameter optimization. To validate the effectiveness of the proposed method on the random forest, we compared it with ensemble methods such as voting, model averaging based on minimizing cross-entropy loss, and commonly used methods such as isolation forest and logistic regression. The dataset was divided into training and test sets in a 70:30 ratio, and the model was trained 60 times. The proposed method was evaluated using the AUC, F1-score, and recall.
Fig 2 illustrates the performance of the model averaging anomaly detection method based on minimizing the MFL criterion on the test set, with average AUC, recall, and F1-score values of 0.8801, 0.7646, and 0.8598, respectively. On the network intrusion dataset, our proposed method achieved the best performance among the tested methods in all metrics, with respective improvements of ,
, and
over the second-best values of AUC, recall, and F1-score, respectively.
Model averaging methods show no significant differences in AUC. Mallows-like method performs well on F1-score and recall, indicating effective detection of outliers.
To further validate the method’s effectiveness in anomaly detection, we selected nine imbalanced datasets from UCI, spanning various domains such as medicine, industrial production, agricultural production, and image classification, and utilized all their features. Simple random sampling was applied to control the ratio of positive to negative samples at 0.05. The datasets were split into training and test sets at a 70:30 ratio. We compared model averaging criteria that minimize different loss functions, along with commonly used outlier detection algorithms such as GMM, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and LOF [22, 23]. The proposed method was evaluated based on AUC, F1-score, and recall. We performed Bayesian hyperparameter tuning. Table 1 presents the results of hyperparameter optimization for the primary outcomes of the model.
Table 2 compares the AUC values of the proposed MFL method and other anomaly detection algorithms. Our model achieves up to an 18.31% improvement over the second-best model across different datasets, and a 9.50% improvement in the mean across nine datasets, which demonstrates its strong data fitting capability. Note that some methods have lost their capacity to effectively distinguish between classes due to the limitations of the methods employed and the severe imbalance in the data. Model averaging methods that use default parameter settings tend to achieve similar fitting performance, highlighting the importance of Bayesian hyperparameter tuning to enhance model performance. Table 3 compares the recall of the MFL method and other anomaly detection algorithms. Recall is the proportion of detected anomalies out of the total number of actual anomalies and hence is crucial in anomaly detection. Compared with commonly used model averaging methods and conventional anomaly detection techniques, our method achieves a 29.17% improvement in the mean recall and demonstrates a clear advantage in most scenarios.
Table 4 presents the F1-score value of our proposed method and the comparison models, which serves as a balanced measure of precision and recall. Our approach demonstrates superior performance across most scenarios, highlighting its robust predictive capability with imbalanced datasets.
In the above experiments, we employed simple random sampling to control the positive-to-negative ratio at 1:20. To further demonstrate the applicability of our method and its ability with extremely imbalanced data classification, we conducted experiments on the impact of the positive-to-negative ratio using a network intrusion dataset. Several points were selected, with positive-to-negative ratios ranging from 1:10 to 1:100. The evaluation was based on AUC and recall.
We first analyzed the impact of the positive-to-negative ratio on AUC and recall. Fig 3 shows the results of experiments conducted with ratios of 0.01, 0.02, 0.03, 0.05, 0.07, and 0.10. Model averaging methods generally outperform individual anomaly detection models. When the ratio is between 0.05 and 0.10, our proposed method shows a significant advantage in AUC and recall. For ratios of 0.02 and 0.03, all model averaging methods perform approximately the same. We speculate that the performance advantage is due to the tree structure of the base models. As the ratio further decreases, logistic regression fails to effectively distinguish between classes. Overall, our proposed method performs well under highly imbalanced conditions.
A: AUC; B: recall. Mallows-like loss function demonstrates superior predictive performance across multiple scenarios, indicating significant advantages of model averaging methods based on this loss function.
3.2 Synthetic data experiment
In this experiment, we employed a data sampling method to control the proportion of sample points. However, such an approach may lead to distribution shifts in the data, thereby reducing the reliability of results. To address this issue, we utilized Adaptive Synthetic Sampling [24] to regenerate data points from the sampled data. Using the classic Spambase dataset as an example, the primary performance metrics of various models based on both original and augmented data are summarized in Table 5.
Through the application of the ADASYN method, it is observed that the performance of most models has been enhanced. This indicates that ADASYN effectively synthesizes valuable data points, thus improving model performance. It is worth noting that, despite a reduction in the margin by which our model leads, it still achieves the best performance among the tested methods, which suggests that the improvements in our approach stem from innovations in the model architecture rather than shifts in data distribution and also that our method can robustly handle such distributional shifts.
4 Conclusion and future work
We proposed a Mallows-like model averaging criterion for anomaly detection based on the focal loss function. This criterion was implemented in a random forest algorithm to address the occurrence of extremely imbalanced data. We compared our method with other ensemble models, as well as commonly used anomaly detection methods, on public benchmark datasets. The results indicated the superior performance of our method in terms of anomaly recall and classification accuracy.
In the future, a Mallows-like focal loss criterion in heteroscedasticity could be investigated, possibly replacing the variance term in the Mallows criterion with the residuals for each sample. The performance of the proposed approach is constrained by the selection of hyperparameters in focal loss. A possible future research direction involves the design of efficient hyperparameter tuning methods tailored for anomaly detection algorithms and theoretical support for classification asymptotic optimality.
5 Supporting information
S1 Table. Hyperparameters of different datasets
https://doi.org/10.1371/journal.pone.0323333.s001
(PDF)
S2 Table. AUC scores of anomaly detection algorithms.
https://doi.org/10.1371/journal.pone.0323333.s002
(PDF)
S3 Table. Recall scores of anomaly detection algorithms.
https://doi.org/10.1371/journal.pone.0323333.s003
(PDF)
S4 Table. F1-scores of anomaly detection algorithms.
https://doi.org/10.1371/journal.pone.0323333.s004
(PDF)
S5 Table. Experimental results after data augmentation using ADASYN.
https://doi.org/10.1371/journal.pone.0323333.s005
(PDF)
S1 Fig. Schematic diagram of proposed model averaging method.
https://doi.org/10.1371/journal.pone.0323333.s006
(TIF)
References
- 1. Feroze A, Daud A, Amjad T, Hayat MK. Group anomaly detection: Past notions, present insights, and future prospects. SN Comput Sci. 2021;2(3).
- 2. Fernandes G Jr, Rodrigues JJPC, Carvalho LF, Al-Muhtadi JF, Proença ML Jr. A comprehensive survey on network anomaly detection. Telecommun Syst. 2018;70(3):447–89.
- 3. Thudumu S, Branch P, Jin J, Singh J. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data. 2020;7(1).
- 4. Hilal W, Gadsden SA, Yawney J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst Applic. 2022;193:116429.
- 5.
Tuor A, Kaplan S, Hutchinson B, Nichols N, Robinson S. Deep learning for unsupervised insider threat detection in structured cybersecurity data streams. In: Workshops at the 31st AAAI conference on artificial intelligence; 2017.
- 6.
Xiang T, Zhang Y, Lu Y, Yuille AL, Zhang C, Cai W, et al. SQUID: Deep feature in-painting for unsupervised anomaly detection. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE; 2023. p. 23890–901. https://doi.org/10.1109/cvpr52729.2023.02288
- 7.
Suri N, Murty M, Athithan G. Outlier detection: Techniques and applications. Springer Nature; 2019. p. 3–11.
- 8. Mallows CL. Some comments on Cp. Technometrics. 2000;42(1):87–94.
- 9. Cheng W, Li X, Li X, Yan X. Model averaging for generalized linear models with missing at random covariates. Statistics. 2022;57(1):26–52.
- 10. Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. J Am Stat Assoc. 1997;92(437):179–91.
- 11. Hansen BE. Least squares model averaging. Econometrica. 2007;75(4):1175–89.
- 12. Willmott C, Matsuura K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res. 2005;30:79–82.
- 13. Ajide KB, Lanre Ibrahim R. Bayesian model averaging approach of the determinants of foreign direct investment in Africa. Int Econ. 2022;172:91–105.
- 14. Abdulganiyu OH, Tchakoucht TA, Saheed YK, Ahmed HA. Xidintfl-vae: Xgboost-based intrusion detection of imbalance network traffic via class-wise focal loss variational autoencoder. J Supercomput. 2025;81(1):1–38.
- 15. Saheed YK, Kehinde TO, Ayobami Raji M, Baba UA. Feature selection in intrusion detection systems: A new hybrid fusion of Bat algorithm and residue number system. J Inform Telecommun. 2023;8(2):189–207.
- 16. Abdulganiyu OH, Tchakoucht TA, Saheed YK. Towards an efficient model for network intrusion detection system (IDS): Systematic literature review. Wireless Netw. 2023;30(1):453–82.
- 17. Saheed YK, Omole AI, Sabit MO. GA-mADAM-IIoT: A new lightweight threats detection in the industrial IoT via genetic algorithm with attention mechanism and LSTM on multivariate time series sensor data. Sens Int. 2025;6:100297.
- 18.
Lin TY, Goyal P, Girshick R. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. IEEE; 2017. p. 2980–8.
- 19.
Qiu Y, Xie T, Yu J. Mallows-type averaging machine learning techniques; 2023.
- 20. Snoek J, Larochelle H, Adams R. Practical Bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems. 2012;25.
- 21. Markovsky I, Van Huffel S. Overview of total least-squares methods. Signal Process. 2007;87(10):2283–302.
- 22.
Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining; 1996. p. 226–31.
- 23.
Breunig M, Kriegel H, Ng R, Sander J. LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data; 2000. p. 93–104.
- 24.
He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on computational intelligence). IEEE; 2008. p. 1322–8. https://doi.org/10.1109/ijcnn.2008.4633969