Figures
Abstract
Currently, deep learning models are widely used in many classification applications, but their utilization is limited by some factors. The large models can ensure classification of wide range, but they cannot be deployed to some small devices. The small models can be deployed to the small devices, but the number of labels is limited. To solve these problems, this paper proposes a classification method based on the Fusion of Multi-level Deep Learning Models (FM-DLM). We apply the Baidu-AI platform as a Level 0 model for classification of wide range samples. Then, we use the difference between Level 1 models to perform dataset prediction. Then, we can use the Level 2 models that were trained on the predicted dataset, which is to perform label classification. Finally, we use label distribution to achieve higher accuracy. The experimental results show that our method can achieve higher accuracy than the existing methods while ensuring a wide range of classification.
Citation: Jin G, Li H, Du H, Song Q (2026) FM-DLM: A new method for image classification based on the fusion of multi-level deep learning models. PLoS One 21(1): e0338137. https://doi.org/10.1371/journal.pone.0338137
Editor: Claudionor Ribeiro da Silva, Universidade Federal de Uberlandia, BRAZIL
Received: March 12, 2025; Accepted: November 18, 2025; Published: January 27, 2026
Copyright: © 2026 Jin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: References [23] https://paperswithcode.com/sota/image-classification-on-cifar-10 references [24] https://paperswithcode.com/sota/image-classification-on-cifar-100 references [25] https://paperswithcode.com/dataset/mini-imagenet references [26] https://paperswithcode.com/dataset/eurosat references [27] https://www.kaggle.com/datasets/puneet6060/intel-image-classification references [28,29] https://www.kaggle.com/datasets/hojjatk/mnist-dataset Code is shared at the following link https://zenodo.org/records/16777230.
Funding: The fund of Beijing Polytechnic(2023X005-KXD) The fund of Beijing Polytechnic plays role in study design, data collection.
Competing interests: The authors have declared that no competing interests exist.
Introduction
With the development of deep learning technology, various models are constantly being designed [1,2]. The current development trend is design large models, which requires more computing resources and leads to high training cost [3,4]. Therefore, some small devices may not be able to implement these large models.
With the emergence of commercial large model platforms, we can perform the classification task by calling APIs (Application Programming Interfaces) [5,6]. Although these platforms can provide the classification service through the network, the training of new samples is not convenient, which causes low classification accuracy on these samples.
The fusion of deep learning models is another solution. To increase accuracy, some methods fuse the output of multiple models [7,8]. On the other hand, the accuracy depends on the output of each model. Therefore, the selection of high-precision models is the decisive factor. In addition, if these models are trained on different datasets, we should first predict the dataset.
To address these challenges, we try to efficiently utilize a large model platform and the fusion of deep learning models to achieve a wide range of classification and high accuracy. The contributions of this paper can be summarized as follows. The first one is that we efficiently match the result of the large model to the label of the datasets, which ensures a wide range of classification. The second is that our method optimizes the fusion methods, which achieves dataset prediction while ensuring high accuracy. The third one is that our method can be deployed to different types of small devices, which solves the problem of device dependency.
The structure of our paper is as follows. The first section is the introduction and the second one is related work. The third section introduces our method and the fourth one introduces the experiment. The fifth section summarizes this paper and discusses future work.
Related work
In this paper, we choose some existing deep learning models, large model platform and fusion methods as the baseline. The first type of deep models is based on SNN (Spiking Neural Networks) mechanism. The second type of deep models include others that have different kinds of structures. We introduce a large model platform that will be utilized at our Level 0. Then, we introduce some fusion methods that will be utilized at the other levels.
The first type of deep learning models is based on SNN. The ANN-SNN model applies the quantization clip-floor-shift activation function to replace the ReLU (Rectified Linear Unit), which can better approximate the activation function [9]. Hybrid training SNN (Spiking Neural Networks) utilizes a computationally efficient training technique [10]. The Low-Latency SNN model proposes a low-latency deep spiking network trained with gradient descent, which optimizes the membrane leak and the firing threshold [11]. The direct training SNN model proposes a neuron normalization technique to adjust neural selectivity and develops a direct learning algorithm for deep SNNs [12]. The TSSL-BP (Temporal Spike Sequence Learning Back Propagation) model uses a novel temporal spike sequence learning back propagation method for training deep SNNs [13]. The TDBN (Threshold Dependent Batch Normalization) model enables direct training of a very deep SNN and the efficient implementation of its inference on hardware [14]. These models utilize different optimization techniques based on the structure of the SNN. Thus, the performance is limited by the structure of the SNN.
We introduce some different deep learning models to expand the range of model selection in our method case. The TET (Temporal Efficient Training) model introduces a temporal efficient training approach to compensate for the loss of momentum in gradient descent [15]. The MPD (Membrane Potential Distribution) model attempts to rectify the membrane potential distribution by designing a novel distribution loss, which can explicitly penalize the undesired shifts without introducing any additional operations in the inference phase [16]. The WRN (Wide Residual Networks) model proposes a ground radar target classification algorithm and an attention mechanism [17]. The DeiT (Data-efficient image Transformers) model produces a competitive convolution-free transformer by training only on ImageNet [18]. The Swin (Shifted Window) model presents a new vision transformer that capably serves as a general-purpose backbone for computer vision [19]. These models utilize different structures and parameter tuning techniques, which achieve high accuracy on various datasets. On the other hand, there are no models that can achieve the highest accuracy on all datasets. Thus, we try to efficiently utilize these models to achieve high accuracy on multiple datasets.
To enable a wide range of classification, we select the Baidu-AI platform as the large model platform [20]. This platform can identify more than 100,000 kinds of objects and scenes, and provide corresponding API services to fully meet the application needs of various developers and enterprise users. The API request is used for general object and scene recognition; that is, for an input picture (which can be decoded normally and has an appropriate aspect ratio), multiple object and scene labels in the picture are output.
To efficiently use multiple models to achieve high accuracy, we choose some of the latest fusion methods. These methods can fuse the outputs of models on multiple datasets. Voting methods combine the top-performing models to achieve the high accuracy [21]. Weighted voting methods try to fuse various deep neural network models to achieve high accuracy [22].
We summarize these methods in Table 1. Basically, these are summarized as three types,
single models, large model platforms and fusion methods. We introduce the advantages and limitations of these methods in this table.
Our method
In order to help understand the methods in this paper, we first provide some definitions.
Preliminaries
Firstly, we define Sn as a sample and G(Sn) as the ground truth of Sn. We name a dataset to be Dj. Then, we define a deep learning model Mi that is trained on Dj as Mi,j. We define the output of model Mi,j on Sn as F (Mi,j, Sn). Then, we can define the accuracy of Mi,j on {Sn} as . On some samples, if the model with the highest classification accuracy is Mi,j, we define these samples to belong to Dj. We define Lk, j as a label of Dj. We define
is the number of labels in {Lk, j}. Then, we construct the levels by the following equation.
where represents the number of labels in Level
and
is the number of layers at Level j. Equation 1 represents the relationship between models in different levels. Lower-level models have more labels than higher-level models. On the contrary, due to the targeted training of corresponding datasets, we assume that the higher-level model achieves higher accuracy.
We can define the classification task in this paper as follows. When there are multiple datasets {Dj}, a classification process should first predict which Dy includes Sn. Then, on the predicted dataset Dy, it should classify the label Lx,y of Sn.
Our framework
In this subsection, we illustrate our method (named FM-DLM) as shown in Fig 1. Step 1 is the selection of models, including the selection of a large model platform, based on whether they are shared and can be deployed to our devices. Step 2 is model training, which follows the training process outlined in the relevant papers. Step 3 is model selection, mainly selecting some trained models with higher accuracy on the corresponding dataset.
After these preparations, we can utilize our method to classify the samples. Step 4 uses the large model platform (named as the Level 0 model) to classify the labels of samples. Then, our method filters the label from Level 0 to that of Level 1, which is used to temporally predict the dataset. Step 5 aims to predict the dataset more precisely according to the differences among Level 1 models. Step 6 selects the trained models (named Level 2 models) of the predicted dataset and fuse their outputs to classify the label. Step 7 uses the label distribution to optimize the results, which further increases classification accuracy.
In practical applications, each device can choose the appropriate type and number of models based on storage and hard disk capacity. When we use the model platform at Level 0, we only need to ensure the internet connection. When we select the Level 1 models, we can choose the appropriate number of models according to memory capacity. Then, we can also run the models one by one in memory, which is suitable for small memory sizes. We summarize the levels as shown in Table 2.
Step 1: Select models
Firstly, we collect some models based on the following conditions. The first condition is that these models should be open-source and shareable. The second condition is that the memory consumption of these models is smaller than that of our GPU. The third condition is that we can deploy these models without any bug.
We train these models following the training steps that are introduced in the related papers. Some models may have low performance on datasets that are not introduced in these papers. In other words, the tuning of these models is highly related to the corresponding datasets. Thus, the selection of trained models in Step 3 is important.
Step 2: Train models
If we use a commercial large model platform as Level 0 model, we don’t need model training at this level. These large model platforms can classify a wide range of labels.
When training the models at Level 1, we first prepare some existing models {Mi} and public datasets {Dj}. We select each dataset Dj and divide it into a training set , a validation one
and a testing one
. Then, each model Mi is trained on
to obtain a trained model Mi,j. When we train Mi on another Djj, we can get a trained model Mi,jj. Each model is trained on one dataset in our method.
Step 3: Select trained models
The rule for selecting trained models is based on their accuracy on the validation sets. We choose some models {Mx,j} with higher accuracy among the trained models on each dataset as follows.
where is defined by Equation 1 and Topi means we select the top i models that achieved higher accuracy than the others on the validation set. The number i is also determined by the validation set.
Step 4: Filter
We can classify the label of a sample by Level 0 model. If the classified label belongs to the collected datasets {Dj}, we can continue to the next process. Otherwise, the classified label is the final result. The large model platform has a relatively wide distribution of labels, but the label format follows its own rules. Therefore, there are some differences between the labels provided by the large model platform and those of {Dj}. Thus, we should map the labels of Level 0 to those of Level 1. We can define the mapping as follows.
where is the label of Level 0 and Lk,j is the label of Level 1.
is defined by Equation 1, and
is the probability when the ground truth G(Sn) is the label Lk,j.
Step 5: Dataset prediction
When we classify Sn that belongs to a certain dataset by Step 4, we should predict which dataset contains this sample. We use the difference among Level 1 models to classify the dataset that may include this sample.
We define the probability of a label Lk, j on a sample Sn by the output of Mi,j as P (Mi,j, Sn, Lk,j). On the validation set, we can get the weights {Wi,j} as shown in the following equation.
where here is defined as the accuracy of trained model Mi,j on the validation set Dj. We define the difference between Mi,j with Mii,j on a sample Sn as follows.
where P (Mi,j, Sn, Lk,j), Wi,j and Wii,j are defined by Equation 4. Wi,j is the weight that is related to model Mi,j and Wii,j is the weight that is related to model Mii,j. We can select Mii,j that achieve the highest accuracy on . Then, we can predict the dataset that may include Sn by the following equation.
where we define Dy as the predicted dataset that may contain the sample Sn.
Step 6: Label classification
After the Dy is obtained through Equation 6, we can classify the labels by the following equation:
where Wi,y and P (Mi,y, Sn, Lk,y) are introduced in Equation 4. P (Mi,y, Sn, Lk,y) is the output of the trained models of Dy.
Step 7: Label distribution
Generally, a validation set is used to simulate the corresponding testing set. Therefore, we can assume that the label distribution of the validation set is the same as that of the testing set. We compute the distribution of label Lk,j on dataset using the following equation.
where presents the number of samples (that belong to
), for which the ground truth is Lk,j.
is the number of all samples that belong to
. After the labels are classified by Equation 7, the scores of some results may be low. We can set thresholds based on the validation set to select some of these results. For these results, we further increase the accuracy using the following equation:
where P (Mi,y, Sn, Lk,y) is defined by Equation 4 and is defined by Equation 8. We also compute the hyper-parameter
on the validation set.
Meticulous pseudo code related to our method
We introduce the pseudo code related to our method in Table 3.
Experiment
Experimental setup
We selected three public datasets: the CIFAR-10 dataset [23], the CIFAR-100 dataset [24], and the Mini-ImageNet dataset [25]. Generally, for each dataset, we use 70% samples for training and 10% of validation, and 20% for testing.
We select some shared models, which are ANN-SNN [9], Hybrid training SNN [10], Low-latency SNN [11], Direct training SNN [12], TSSL-BP [13], TDBN [14], TET [15], MPD [16], WRN [17], DeiT [18], Swin (Shifted Window) [19]. The selection depends on the possibility of implementation on our device. We select Baidu-AI large model platform as the Level 0 model. Table 4 shows the details of classification by Baidu-AI.
For better comprehension, we use Table 5 to explain the evaluation metrics.
The evaluation of trained models (Step 3)
Table 6 shows the classification accuracy of different models on three public datasets. Due to the different optimization details, the accuracy of these models is different on various data sets. We build our framework based on the selection of these models at Step 3.
The evaluation of dataset prediction (Step 5)
Table 7 shows the accuracy of the dataset prediction at Step 5. Compared with the existing methods, our method is 3.5% higher on CIFAR-10, 2.47% higher on CIFAR-100, and 3.96% higher on Mini-ImageNet than those of the existing methods.
The evaluation of label classification with dataset prediction (Step 6)
Table 8 shows the accuracy of the label classification with of Step 6. Compared with the existing methods, our method is 4.01% higher on CIFAR-10, 3.01% higher on CIFAR-100, and 5.03% higher on Mini-ImageNet than those of the existing methods.
The evaluation of label distribution (Step 7)
In the label distribution step, we assigned a random distribution to the samples of labels. Table 9 shows the accuracy of the label distribution. Compared with our method (Step 6), our method with label distribution (Step 7) is 1.83% higher on CIFAR-10, 2.1% higher on CIFAR-100, and 1.5% higher on Mini-ImageNet.
The evaluation of ablation
Table 10 shows the ablation experimental results of the steps. From Step 1 to Step 4, the best performance achieved by the Baidu-AI. After we added dataset prediction at Step 5, we can use the corresponding model to classify the labels, which allows our method to achieve the best performance. When we fuse the outputs of models to classify labels at Step 6, our method also achieves the best performance. Furthermore, when we optimize the results using label distribution at Step 7, our method can achieve higher accuracy than that of Step 6.
The evaluation of the model selection
Fig 2 shows how the selection of models affects the classification accuracy on Mini-ImageNet. In this figure, the blue column shows the methods with the worst (models achieve lower accuracy than those of the others) 5 trained models on each dataset. The orange column shows the methods with the best (models that achieve the highest accuracy on the corresponding dataset) trained models on the corresponding datasets. The green column shows the methods with random selection of the trained models (randomize the number of models and the selection of these models), and then we compute the average accuracy of 100 times. As this figure shows, the selection of the models play important role to the accuracy. Our method achieve the highest accuracy among all of these selections.
The voting method does not consider the importance of high-precision models, which reduces accuracy. The weighted voting method solves this problem, but it only uses the classified label, which is the final output of each model. In contrast, our method fully utilizes the probability of each label before the final output. Therefore, our method can achieve higher accuracy than other methods.
The experiments on more datasets and models
Fig 3 shows our methods on 6 datasets. With the collected 3 datasets, we further collected the EuroSAT dataset [26], Intel-image-classification dataset (named as Intel) [27] and MNIST [28,29]. Furthermore, we adopted related models for MNIST, which are F2PQNN [28] and NoRD [29]. These models achieves high accuracy on MNIST (F2PQNN is 99.09% and NoRD is 96.74%), so we adopt these models in our method. The accuracy on 6 datasets is lower than that of 3 datasets. The variety of samples is increased as there are more datasets, which reduces the accuracy of dataset prediction. Thus, label classification by Level 0 model plays important role in reducing the difficulty of dataset prediction at Level 1. Compared with the weighted voting method, our method achieved higher accuracy on each dataset.
The analysis
Fig 4 shows a simple illustration of our method. As this figure shows, when a sample Sn belongs to D1, the corresponding trained models {Mi,1} are more easily output the similar probabilities of the labels. Furthermore, the probability of ground truth will be higher than those of the others. On the other hand, as the trained models {Mi,0} on D0 cannot effectively capture the features of this sample, it leads to different outputs of models.
Hardware efficiency metrics
The CPU and GPU that are applied in our experiment are shown in Table 11. We select Mini-ImageNet as an example dataset. We record the maximum execution time, maximum memory consumption and FLOPs(G) of single model (we record maximum value among all models), the existing methods and our method, which are shown in Table 12. Our methods generate multiple models on each dataset, which causes the runtime to be larger than that of single model. Furthermore, the connection time with Baidu-AI consumes more time than that on either CPU or GPU.
Compared with the execution time of single model (we record the model achieved maximum execution time), the existing fusion methods run multiple models on the GPU, which leads to longer execution time. Furthermore, these fusion methods fuse the outputs of multiple models on the CPU side, which lead to additional execution time. Our method outputs the probability of the labels on GPU side and computes the final results on CPU side, which leads to longer execution time than those of existing methods.
The existing methods and our one run the models one by one on the GPU side. Thus, the maximum memory consumption of the existing methods is the same as that of single model. Our method needs to store the probability of models, which lead to additional memory consumption.
The execution time for connecting to the Baidu-AI is the same in the cases of single models and fusion methods. Without classification by Baidu-AI, none of the methods can perform wide range classification.
The introduction of the employed acronyms
We use Table 13 to introduce the employed acronyms in this paper.
Conclusions
This paper proposes a new way to efficiently utilize different levels of deep learning models to achieve high classification accuracy while ensuring wide-range classification. Our method solve the matching problem between large model platforms and the deep learning models. Furthermore, we improve the accuracy of dataset prediction and label classification, which is higher than that of the existing fusion methods. Our method can be deployed on small devices, which is important for many applications.
In future work, we will conduct more experiments to study how the diversity of samples affects the performance of trained models, aiming to further increase classification accuracy. Furthermore, as the same wrong results lower the performance of the fusion methods, the similarity of trained models will also be a focus for future research.
References
- 1. Devi SN, Natarajan R, Gururaj HL, Flammini F, Sulaiman Alfurhood B, Krishna S. Ridge Regressive Data Preprocessed Quantum Deep Belief Neural Network for Effective Trajectory Planning in Autonomous Vehicles. Complexity. 2024;2024(1):1–13.
- 2. Karaköse E. An Efficient Satellite Images Classification Approach Based on Fuzzy Cognitive Map Integration With Deep Learning Models Using Improved Loss Function. IEEE Access. 2024;12:141361–79.
- 3. Pietroń M, Żurek D, Śnieżyński B. Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction. J Comput Sci. 2023;67:101971.
- 4. Yao F, Zhang Z, Ji Z, Liu B, Gao H. LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster. J Supercomput. 2024;80(9):12247–72.
- 5. Chai X, Zhang M, Tian H. AI for Science: Practice from Baidu Paddle. In: 2024 Portland International Conference on Management of Engineering and Technology (PICMET), Portland, OR, USA, 2024. p. 1–12.
- 6. Jones N. How should we test AI for human-level intelligence? OpenAI’s o3 electrifies quest. Nature. 2025;637(8047):774–5. pmid:39805930
- 7. Thangavel K, Palanisamy N, Muthusamy S, Mishra OP, Sundararajan SCM, Panchal H, et al. A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models. Soft Comput. 2023;27(19):14205–18.
- 8. Wang S, Ni L, Zhang Z, Li X, Zheng X, Liu J. Multimodal prediction of student performance: A fusion of signed graph neural networks and large language models. Pattern Recogn Lett. 2024;181:1–8.
- 9. Jiang C, Zhang Y. A Noise-Based Novel Strategy for Faster SNN Training. Neural Comput. 2023;35(9):1593–608. pmid:37437192
- 10. He X, Li Y, Zhao D, Kong Q, Zeng Y. MSAT: biologically inspired multistage adaptive threshold for conversion of spiking neural networks. Neural Comput Applic. 2024;36(15):8531–47.
- 11. Rathi N, Roy K. DIET-SNN: A Low-Latency Spiking Neural Network With Direct Input Encoding and Leakage and Threshold Optimization. IEEE Trans Neural Netw Learn Syst. 2023;34(6):3174–82. pmid:34596559
- 12.
Wu Y, Deng L, Li G, Zhu J, Xie Y, Shi L. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, p. 1311–8, 2019. https://doi.org/10.1609/aaai.v33i01.3301131
- 13. Zhang W, Li P. Temporal spike sequence learning via backpropagation for deep spiking neural networks. Adv Neural Inf Process Syst. 2020;33:12022–1203.
- 14. Zheng H, Wu Y, Deng L, Hu Y, Li G. Going Deeper With Directly-Trained Larger Spiking Neural Networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35. 2021. p. 11062–70.
- 15. Deng S, Li Y, Zhang S, Gu S. Temporal efficient training of spiking neural network via gradient re-weighting. In: International Conference on Learning Representations. 2021.
- 16. Guo Y, Tong X, Chen Y, Zhang L, Liu X, Ma Z, et al. RecDis-SNN: Rectifying Membrane Potential Distribution for Directly Training Spiking Neural Networks. In: Conference on Computer Vision and Pattern Recognition (CVPR), 2022. p. 326–35.
- 17. Alsekait D, Zakariah M, Amin SU, Khan ZI, Alqurni JS. Privacy preservation in iot devices by detecting obfuscated malware using wide residual network. Comput Mater Cont. 2024;81(11):2395–436.
- 18. Yadav RK, Daniel A, Semwal VB. Enhancing Human Activity Detection and Classification Using Fine Tuned Attention-Based Transformer Models. SN Comput Sci. 2024;5(8):1–21.
- 19. Yao D, Shao Y. A data efficient transformer based on Swin Transformer. Vis Comput. 2023;40(4):2589–98.
- 20.
https://ai.baidu.com/tech/imagerecognition
- 21. Aurangzeb S, Aleem M. Evaluation and classification of obfuscated Android malware through deep learning using ensemble voting mechanism. Sci Rep. 2023;13(1):3093. pmid:36813846
- 22. Açıkkar M, Tokgöz S. An improved KNN classifier based on a novel weighted voting function and adaptive k-value selection. Neural Comput Appl. 2023;36(8):4027–45.
- 23. Kundroo M, Kim T. Demystifying Impact of Key Hyper-Parameters in Federated Learning: A Case Study on CIFAR-10 and FashionMNIST. IEEE Access. 2024;12:120570–83.
- 24. Huang Y, Zhu Y-H, Zhigao Z, Ou Y, Kong L. Classification of Long-Tailed Data Based on Bilateral-Branch Generative Network with Time-Supervised Strategy. Complexity. 2021;2021(1):1–10.
- 25. Bhakta S, Nandi U, Changdar C, Ghosal SK, Pal RK. emapDiffP: A novel learning algorithm for convolutional neural network optimization. Neural Comput Appl. 2024;36(20):11987–2010.
- 26. Günen MA. Performance comparison of deep learning and machine learning methods in determining wetland water areas using EuroSAT dataset. Environ Sci Pollut Res Int. 2022;29(14):21092–106. pmid:34746985
- 27.
Available from: https://www.kaggle.com/datasets/puneet6060/intel-image-classification
- 28. Li J, Yuan P, Zhang J, Shen S, He Y, Xiao R. F2PQNN: a fast and secure two-party inference on quantized convolutional neural networks. Comput J. 2025;68(8):998–1012.
- 29. Sharma S, Lodhi SS, Srivastava V, Chandra J. NoRD: A framework for noise-resilient self-distillation through relative supervision. Appl Intell. 2025;55(7).