Knowledge distillation based on multi-layer fusion features

Shengyuan Tan; Rongzuo Guo; Jialiang Tang; Ning Jiang; Junying Zou

doi:10.1371/journal.pone.0285901

Abstract

Knowledge distillation improves the performance of a small student network by promoting it to learn the knowledge from a pre-trained high-performance but bulky teacher network. Generally, most of the current knowledge distillation methods extract relatively simple features from the middle or bottom layer of teacher network for knowledge transmission. However, the above methods ignore the fusion of features, and the fused features contain richer information. We believe that the richer and better information contained in the knowledge delivered by teachers to students, the easier it is for students to perform better. In this paper, we propose a new method called Multi-feature Fusion Knowledge Distillation (MFKD) to extract and utilize the expressive fusion features of teacher network. Specifically, we extract feature maps from different positions of the network, i.e., the middle layer, the bottom layer, and even the front layer of the network. To properly utilize these features, this method designs a multi-feature fusion scheme to integrate them together. Compared to features extracted from single location of teacher network, the final fusion feature map contains meaningful information. Extensive experiments on image classification tasks demonstrate that the student network trained by our MFKD can learn from the fusion features, leading to superior performance. The results show that MFKD can improve the Top-1 accuracy of ResNet20 and VGG8 by 1.82% and 3.35% respectively on the CIFAR-100 dataset, which is better than state-of-the-art many existing methods.

Citation: Tan S, Guo R, Tang J, Jiang N, Zou J (2023) Knowledge distillation based on multi-layer fusion features. PLoS ONE 18(8): e0285901. https://doi.org/10.1371/journal.pone.0285901

Editor: Xiangjie Kong, Zhejiang University of Technology, CHINA

Received: September 13, 2022; Accepted: May 3, 2023; Published: August 28, 2023

Copyright: © 2023 Tan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This work was founded by the Sichuan Science and Technology Program (No. 2022YFG0324) and the National Natural Science Foundation of China (No. 11905153).

Competing interests: The authors have declared that no competing interests exist.

Introduction

The great success of computer vision in the past few decades is inseparable from deep neural network (DNN) [1–4] due to the reason that DNN can show excellent performance on many vision tasks [5,6]. Generally speaking, the performance of a network model is positively related to the amount of parameters and computation of the model. However, it is unable to deploy a model with a large amount of parameters on embedded devices with limited resources. The existing methods to solve this problem mainly include knowledge distillation (KD) [7,8], network pruning [9,10], network quantization [11,12] and low-rank factorization [13], among which KD is a very effective method.

The essence of KD is to enable the small student network with a small number of parameters to learn the “dark knowledge” taught by the large teacher network with a large number of parameters, Fig 1 shows the essence of KD The student network can achieve considerable performance improvement, and even make its performance close to that of the teacher network. In the existing KD methods, there are two main ways for “dark knowledge” transfer including logtis distillation [7,14] and features distillation [15,16]. Logits distillation performs knowledge transfer by minimizing the relative entropy of logits predicted by teachers and students. Compared with logits distillation, the features-based methods have a good performance on many tasks, such as the model compression task of image classification network and the model compression task of object detection network [17,18]. Therefore, researchers are more inclined to the features distillation in recent years. But most of them ignored the fact that in the same neural network, the feature maps generated by different neural network layers located in different locations contained different information which may yield different results when they are applied to KD.

Download:

Fig 1. Schematic diagram of the essence of knowledge distillation (KD).

https://doi.org/10.1371/journal.pone.0285901.g001

It is a known fact that the resolution of feature maps generated by neural network will change from high to low under the action of downsampling with layer by layer computation of neural network. And low resolution feature maps have stronger semantic information, while high resolution feature maps have more accurate localization activation due to fewer downsampling times [19]. Thus, we believe that extracting the feature maps of different layers from the neural network for fusion and then distillation can make the knowledge taught by the teacher to the student more high-quality, thereby improving the learning effect of the students.

In the present work, a new method called Multi-feature Fusion Knowledge Distillation (MFKD) is proposed. First, we extract the feature maps generated from different layers of teacher network, and employ a feature pyramid to fuse the extracted feature maps. During this process, the feature maps are corrected. We then obtain the fusion features from the student network by the same way. Finally, the mean squared error (MSE) between the fusion feature maps produced by the teacher and the student is minimized for promoting knowledge transfer.

In a summary, our main contributions are summarized as follows:

Considering that fused features are rich in information, we perform knowledge transfer by fusing feature maps generated by multiple neural network layers.
In the process of fusion, attention mechanism is added to correct the feature map so as to obtain better multi-layer fusion features.
For different datasets and different networks, the distillation effect can be significantly improved.

Related work

Since this paper focuses on compressing neural network models by the method of KD, some excellent related work published in recent years on the model compression and KD are given below.

a) Model Compression.

In order to solve the problem of deploying DNN models on embedded devices with limited storage and computing resources, model compression technology is proposed. There are several methods to realize model compression: network pruning, network quantization, low-rank factorization and KD. Since there are a large number of redundant parameters in the neural network models, and these parameters have a very subtle impact on the final result of the network model. The main idea of pruning is to cut out the unimportant neurons and filters in the network to achieve the purpose of compressing the model [20,21]. The purpose of quantization is to replace the high-precision numbers stored in the original model with low precision numbers [22]. Low-rank factorization refers to sparse the convolution kernel matrix by combining dimensions and imposing low-rank constraints [23]. The most of weight vectors are distributed in the low-rank subspace, and a small number of basis vectors can be used to reconstruct the convolution kernel matrix to achieve the purpose of reducing storage space.

b) Knowledge Distillation.

The framework of knowledge distillation was first proposed by Hinton et al. [7] which is based on logits. By introducing the concept of temperature (T), logits generated by teacher model are softened to obtain soft targets, and logits generated by student are then used to simulate soft targets. However, most of the current knowledge distillation methods are based on features. According to the extraction location of features, we have the following divisions:

(b1) Middle layer.

Romero et al. [15] proposed a two-stage method which extracts the respective intermediate layer features in the teacher network and the student network, and allows the student features to fit the teacher features to obtain a pre-trained student model in the first stage. In the second stage, with the help of parameters obtained in the first stage, the authors used soft targets to train to get the complete student model parameters.

(b2) Bottom layer.

Factor Transfer (FT) [24] is distilled at the end of the last layer group. Here the convolution layer group refers to the combination of N convolution layers with the same size of the output feature map. For example, ResNet56 has three layer groups with nine convolutional layers in each layer group. The method of Heo et al. [25] is slightly different with FT. They chose the position of extracting features between the first ReLU and the end of the layer group which can better preserve the information conveyed by the teacher. Contrastive Representation Distillation (CRD) [17] adds the method of contrastive learning to knowledge distillation, and transfers knowledge by comparing the penultimate layer features (before logits) of the teacher and student networks.

(b3) Multilayer.

Different from FitNets extracting the output features of any intermediate layer, AT [26], FSP [27] and Jacobian [28] methods extract the output features of each convolution layer group when the teacher network and the student network have different depths. Chen et al. [16] used the multi-layer features in the teacher network to guide the learning of a certain layer in the student network to realize the transfer of knowledge in different layers.

It differs from the aforementioned methods, we not only extract the feature maps of different layers from the neural network, but also fuse these feature maps to obtain fused feature maps containing abundant information and having good representation ability. Thereby, it is convenient for students to learn on the network and improve the effect of KD.

Our method

Multi-feature Fusion Knowledge Distillation (MFKD) is a feature-based distillation method. So, for MFKD, feature extraction and processing are particularly important. In this section, the details of the MFKD including notation, features extraction and correction, feature fusion pyramid, and hyper-parameter p are introduced.

• Notation.

Supposed that the teacher network and the student network are represented by N^t and N^s respectively, the convolution network part of N^t has i layer groups, and N^s has j layer groups. The input is represented by x. When x is processed by the network, the set of output feature maps obtained by each layer group in N^t can be expressed as in which is the output of the kth layer group in N^t, and is also the input of the (k+1)th layer group. Similarly, the set of output feature maps obtained by each layer group in N^s can be expressed as . Among them, are also respectively the final output of the entire convolutional layer in N^t and N^s. We select m feature maps from O^t to form a feature extraction set of N^t, which , the resolution of is higher than . The same represents the feature extraction set of N^s with n feature maps (). The final fusion features of N^t and N^s are denoted by and , respectively.

Therefore, the knowledge transfer between teachers and students can be described as an optimization problem with the following expression: (1)

• Features extraction and correction.

We extract features at different positions in the teacher network and the student network to obtain sets and , and use the attention mechanism to correct all feature maps in the set. The theory of the Squeeze-and-Excitation block (SEblock) [29] is used in this paper to achieve the purpose of feature correction, and the framework of the theory is displayed in the following Fig 2.

Download:

Fig 2. The framework of the Squeeze-and-Excitation block.

https://doi.org/10.1371/journal.pone.0285901.g002

In Fig 1(A), SEblock consists of two parts: squeeze and excitation. In the squeeze phase, global average pooling is used for converting the input feature map f with size C×H×W into the output feature map f_squ with size C×1×1, and compressing the information in f into f_squ. In the excitation phase, the mainly idea is a simple gating mechanism with sigmoid activation. This gating mechanism is parameterized by two fully connected layers with a dimensionality reduction ratio of r (in this paper, r = 16). Finally, performing the channel multiplication of f and f_exc to complete the mapping from f to f′, and the mathematical expression is given by (2)

• Feature fusion pyramid.

Since the size of the feature maps generated by different locations of the network is different, it impossible to directly fuse the features. Therefore, we use the feature fusion pyramid method to fuse features. The feature fusion pyramid structure is based on the pyramid feature hierarchy of convolutional neural network [1,2], with the purpose of fusing high-level semantic information and low-level localization features in the neural network, and performing knowledge transfer.

Taking N^t as an example, the framework of the feature fusion pyramid is shown in Fig 3. We first correct and to obtain and with the help of the Squeeze-and-Excitation block, and then fuse these two corrected feature maps. The formula can be expressed as (3)

Download:

Fig 3. Feature fusion pyramid of an N^t.

https://doi.org/10.1371/journal.pone.0285901.g003

Owing to the resolution of is higher than , it is necessary to downsample before fusion. If element-wise addition is used as the feature fusion method (the ablation study case in experiment section), and should also have the same number of channels, this paper uses 1×1 convolutional layers to increase/decrease the dimension of feature maps. It should be noted that when the feature map is processed by downsampling, dimensionality increase/reduction, and fusion, it will be corrected. It found that the feature correction keeps the information carried by the feature map correct, and the final fusion features can perform well after repeated adjustment.

Actually, the feature map will change after the convolution calculation, and the original information in feature map will also be affected. Therefore, we increase/reduce the dimension of after downsampling so that the dimension of matches that of . The biggest advantage of this manner is not change the original dimension of , which can preserve the originality of the information in .

We next correct and to get and , and then fuse them. Similar to the same argument of and , both downsampling and dimension up/down processing are applied to . Therefore, after m-1 times of fusion, we will get the final fusion features of N^t, which can be expressed as (4)

In the same way, the final fusion feature of N^s can be obtained. For MFKD, the realization of knowledge transfer is to let simulate , and short the distance between them. This problem is described in Eq 1.

• Hyperparameter p.

To further improve the performance of MFKD, a hyperparameter p is introduced. In the process of N^s learning from N^t, when the prediction results of N^t are good, the N^s should be to learn from N^t; when the prediction results of N^t are bad, the N^s should be to learn from ground-truth labels. In this way, p becomes the criterion for judging the quality of prediction results of N^t. So the rest task is to set the optimal values of p. Here we divide a dataset into α batches, and send them to N^t to calculate the prediction results. Using the ascending order, these results are arranged in a vector PRE = {pre₁, pre₂, pre₃,…,pre_α} where pre_i<pre_j when i<j. Furthermore, we set a percentage β to determine the value of p = pre_α·β.

Experiment

• Implement Details

a) Dataset.

Two classical image classification datasets, CIFAR-10 [30] and CIFAR-100 [30], are selected to validate the effectiveness of MFKD. The details of the two datasets are described in the following.

CIFAR-10 has a total of 60K color images, including a training set with 50K images and a test set with 10K images. Each image size is 32 × 32pixels. There are 10 categories in total, and each category has 6K images.

Similar to the CIFAR-10, CIFAR-100 has also 60K color images including 50K training images and 10K test images. Further, it has a total of 100 categories where each category has 600 images including 500 training images and 100 test images. The image size is 32× 32 pixels.

b) Models.

Three kinds of neural networks are applied in our experiment including: ResNet [4] which is narrow and deep; WideResNet [31] which is wider but shallower than ResNet; VGG [2] which is a classical linear structure network. In addition, our experiments focus on knowledge distillation between networks with the same architectural style, such as: teacher is VGG13, student is VGG8.

c) Setting.

Data augmentation. For the training set of CIFAR dataset, we first fill 4pixels around the image, then randomly cut the image to 32×32pixels, perform random horizontal flip with a probability of 0.5, and finally normalize the image with the mean and standard deviation of each channel. But for the test datasets, the normalized is only applied to process data.

Training parameter settings (Table 1). To verify the effectiveness of MFKD, we use the same parameter settings for baseline training and distillation training. The Stochastic Gradient Descent (SGD) algorithm is applied for network optimization where the momentum of SGD is set to 0.9 and the weight decay is 5e-4. The initial learning rate is 0.05, and it decays to 0.1 times of the previous time at the 150th, 180th, and 210th epochs, respectively. A total of 240 trained epochs are predetermined, and the batchsize at training time is 64.

Download:

Table 1. Details of experimental parameter settings.

https://doi.org/10.1371/journal.pone.0285901.t001

• Results

a) CIFAR-10.

In the CIFAR-10 dataset, we perform two groups of cases, one is ResNet56 as teacher, ResNet20 as student; the other is WRN_40_2 as teacher, WRN_16_2 as student.

In the first case, both ResNet56 and ResNet20 contain three convolutional layer groups, we extract features at the output position of each layer group, and use the average of three experimental results as the final result. Compared to conventional training, MFKD improves the Top-1 accuracy of ResNet20 by 0.48%, which is also slightly better than several other methods. The experimental results are shown in Table 2.

Download:

Table 2. Top-1 accuracy of student network with ResNet20 on CIFAR-10 test dataset.

https://doi.org/10.1371/journal.pone.0285901.t002

In the second case, WRN_40_2 and WRN_16_2 both have three convolution layer groups, and we extract the output feature maps of each layer group for fusion. Taking average of the three experiments as the final result, our method improves the Top-1 accuracy of WRN_16_2 by 0.59% compared with the conventional training, which is also slightly better than the other methods. The experimental results are tabulated in Table 3.

Download:

Table 3. Top-1 accuracy of student network WRN_16_2 on CIFAR-10 test dataset.

https://doi.org/10.1371/journal.pone.0285901.t003

b) CIFAR-100.

For the CIFAR-100 dataset, we consider two groups of cases, the first case is VGG13 as teacher, VGG8 as student; the second case is ResNet56 as teacher, ResNet20 as student.

In case 1, both VGG13 and VGG8 contain five convolutional layer groups, and we select the output positions of the 2nd, the 3rd, and the 4th layer group to extract features to validate MFKD. The average of three experiments is computed as the final experimental result. One checks easily that the MFKD can improve the Top-1 accuracy of VGG8 by 3.35% which compared to one obtained by conventional training. The experimental results are displayed in Table 4.

Download:

Table 4. Top-1 accuracy of student network with VGG8 on CIFAR-100 test set.

https://doi.org/10.1371/journal.pone.0285901.t004

Similar as the operation scheme of the first case of CIFAR-10, MFKD is carried on the CIFAR-100 dataset. The computational results show that the MFKD improves the Top-1 accuracy of ResNet20 by 1.82% compared to conventional training. The experimental results are given in Table 5.

Download:

Table 5. Top-1 accuracy of student network with ResNet20 on CIFAR-100 dataset.

https://doi.org/10.1371/journal.pone.0285901.t005

• Ablation Study

It is known that the location of feature extraction and the way of fusion are two important factors affecting MFKD. Here, we conducted a detailed ablation experiment.

a) Extract Location.

Thanks to the VGG network having 5 different layer groups, it is convenient for us to explore the influence of the change of the extraction position on MFKD. Therefore the experiment in the ablation study is completed by the VGG network.

Compared with the original extraction combination extracted the output features of the 2nd, the 3rd, and the 4th layer group of the network, we designed two extraction combinations named Combination A and Combination B in what follows. Combination A: we use the output features of the 5th layer group to replace the output features of the 2nd layer group based on the purpose of which is to replace the previous layer features with the bottom layer features. Combination B: we replace the output features of the 3rd layer group with the output features of the 5th layer group owing to the reason that replacing the middle layer features with the bottom layer features. Taking VGG13 as the teacher and VGG8 as the student, the experimental results on CIFAR-100 are shown in Table 6.

Download:

Table 6. The performance of the student network VGG8 on the CIFAR-100 dataset by using different extraction combinations.

https://doi.org/10.1371/journal.pone.0285901.t006

It can be seen that the original extraction combination includes the features of the front, the middle and the bottom part of the network, and the performance of MFKD decreases after the front or the middle layer features are missing.

b) Fusion Method.

We explore the fusion of two feature maps: ADD and CONCAT where ADD refers to adding feature map A and feature map B by means of element-wise, while CONCAT means to concatenate feature map A and feature map B into a new feature map.

When using ADD for fusion, the value of the fusion feature is obtained by averaging the value computed by adding the two feature maps on each pixel. When using CONCAT, a structure such as 1×1conv-3×3conv-3×3conv is applied to process the spliced feature maps to achieve the purpose of dimensionality raising/lowering and feature collection. The result looks like this.

It follows from Tables 7 and 8 that CONCAT performs better than ADD for the residual network ResNet, while ADD is better than CONCAT for the linear network VGG. The choice of fusion method varies with the network structure.

Download:

Table 7. The Top-1 accuracy of the student network ResNet20 using different fusion methods where the teacher is ResNet56 on CIFAR-100.

https://doi.org/10.1371/journal.pone.0285901.t007

Download:

Table 8. The Top-1 accuracy of the student network VGG8 using different fusion methods where the teacher is VGG13 on CIFAR-100.

https://doi.org/10.1371/journal.pone.0285901.t008

• Extension

In addition, we perform our method between teacher networks and student networks with different structural styles where ResNet50 with four convolutional layer groups as the teacher network, and VGG8 with five convolutional layer groups as the student network. On dataset CIFAR-100, we extract the output feature maps of the 2nd, the 3rd, and the 4th layer group of ResNet50, and extract the output feature maps of the 3rd, the 4th, and the 5th layer group of VGG8, respectively.

The fused features of ResNet50 are processed by using 1×1 convolutions to match the number of channels with those of VGG8. MFKD improves the Top-1 accuracy of VGG8 by 2.18% compared to regular training. The experimental results are given in Table 9.

Download:

Table 9. Top-1 accuracy of student network VGG8 on CIFAR-100 dataset.

https://doi.org/10.1371/journal.pone.0285901.t009

Conclusions and future work

In the present study, the multi-layer feature fusion knowledge distillation (MFKD) is proposed to improve the performance of student network. Specifically, we first design the feature fusion pyramid to effectively fuse multiple layers of features together. Then, the quality of feature maps is refined by the attention mechanism. Finally, by setting hyperparameters, students can choose the object of study to further improve the distillation effect. Experiments show that MFKD can significantly outperform state-of-the-art methods.

In the future, we may explore our MFKD in a comprehensive case of teacher network and student network have different structural styles. Further, the applications of MFKD in image detection, image segmentation and other tasks are another research interests.

References

1. Hinton GE, Krizhevsky A, Sutskever I. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012;25(1106–1114):1.
- View Article
- Google Scholar
2. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014. Available from: https://arxiv.org/abs/1409.1556v1#.
- View Article
- Google Scholar
3. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al., editors. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition; 2015.
- View Article
- Google Scholar
4. He K, Zhang X, Ren S, Sun J, editors. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.
- View Article
- Google Scholar
5. Long J, Shelhamer E, Darrell T, editors. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition; 2015.
- View Article
- Google Scholar
6. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015;28.
- View Article
- Google Scholar
7. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint arXiv:150302531. 2015. Available from: https://arxiv.org/abs/1503.02531.
- View Article
- Google Scholar
8. He R, Sun S, Yang J, Bai S, Qi X, editors. Knowledge distillation as efficient pre-training: Faster convergence, higher data-efficiency, and better transferability. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022.
- View Article
- Google Scholar
9. Li H, Kadav A, Durdanovic I, Samet H, Graf HP. Pruning filters for efficient convnets. arXiv preprint arXiv:160808710. 2016. Available from: https://arxiv.org/abs/1608.08710v2.
- View Article
- Google Scholar
10. Li Y, Adamczewski K, Li W, Gu S, Timofte R, Van Gool L, editors. Revisiting random channel pruning for neural network compression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022.
- View Article
- Google Scholar
11. Hong C, Kim H, Baik S, Oh J, Lee KM, editors. Daq : Channel-wise distribution-aware quantization for deep image super-resolution networks. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2022.
- View Article
- Google Scholar
12. Han T, Li D, Liu J, Tian L, Shan Y, editors. Improving low-precision network quantization via bin regularization. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021.
- View Article
- Google Scholar
13. Luo Y, Zhao X-L, Meng D, Jiang T-X, editors. HLRTF: Hierarchical Low-Rank Tensor Factorization for Inverse Problems in Multi-Dimensional Imaging. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022.
- View Article
- Google Scholar
14. Zhao B, Cui Q, Song R, Qiu Y, Liang J, editors. Decoupled knowledge distillation. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition; 2022.
- View Article
- Google Scholar
15. Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:14126550. 2014. Available from: https://arxiv.org/abs/1412.6550.
- View Article
- Google Scholar
16. Chen P, Liu S, Zhao H, Jia J, editors. Distilling knowledge via knowledge review. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021.
- View Article
- Google Scholar
17. Tian Y, Krishnan D, Isola P. Contrastive representation distillation. arXiv preprint arXiv:191010699. 2019. Available from: https://arxiv.org/abs/1910.10699v1.
- View Article
- Google Scholar
18. Yao L, Pi R, Xu H, Zhang W, Li Z, Zhang T, editors. G DetKD: towards general distillation framework for object detectors via contrastive and semantic-guided feature imitation. Proceedings of the IEEE/CVF international conference on computer vision; 2021.
- View Article
- Google Scholar
19. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S, editors. Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.
- View Article
- Google Scholar
20. Lin M, Ji R, Wang Y, Zhang Y, Zhang B, Tian Y, et al., editors. Hrank : Filter pruning using high-rank feature map. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020.
- View Article
- Google Scholar
21. Wang Z, Li C, Wang X, editors. Convolutional neural network pruning with structural redundancy reduction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021.
- View Article
- Google Scholar
22. Kim D, Lee J, Ham B, editors. Distance-aware quantization. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021.
- View Article
- Google Scholar
23. Lin S, Ji R, Chen C, Tao D, Luo J. Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE transactions on pattern analysis and machine intelligence. 2018;41(12):2889–905. pmid:30281439
- View Article
- PubMed/NCBI
- Google Scholar
24. Kim J, Park S, Kwak N. Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems. 2018;31.
- View Article
- Google Scholar
25. Heo B, Kim J, Yun S, Park H, Kwak N, Choi JY, editors. A comprehensive overhaul of feature distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019.
- View Article
- Google Scholar
26. Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:161203928. 2016. Available from: https://arxiv.org/abs/1612.03928v2.
- View Article
- Google Scholar
27. Yim J, Joo D, Bae J, Kim J, editors. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.
- View Article
- Google Scholar
28. Srinivas S, Fleuret F, editors. Knowledge transfer with jacobian matching. International Conference on Machine Learning; 2018: PMLR.
- View Article
- Google Scholar
29. Hu J, Shen L, Sun G, editors. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition; 2018.
- View Article
- Google Scholar
30. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. 2009.
- View Article
- Google Scholar
31. Zagoruyko S, Komodakis N. Wide residual networks. arXiv preprint arXiv:160507146. 2016. Available from: https://arxiv.org/abs/1605.07146v1.
- View Article
- Google Scholar
32. Huang Z, Wang N. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:170701219. 2017. Available from: https://arxiv.org/abs/1707.01219.
- View Article
- Google Scholar
33. Ahn S, Hu SX, Damianou A, Lawrence ND, Dai Z, editors. Variational information distillation for knowledge transfer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019.
- View Article
- Google Scholar
34. Peng B, Jin X, Liu J, Li D, Wu Y, Liu Y, et al., editors. Correlation congruence for knowledge distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019.
- View Article
- Google Scholar
35. Park W, Kim D, Lu Y, Cho M, editors. Relational knowledge distillation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019.
- View Article
- Google Scholar
36. Tung F, Mori G, editors. Similarity-preserving knowledge distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019.
- View Article
- Google Scholar
37. Passalis N, Tefas A, editors. Learning deep representations with probabilistic knowledge transfer. Proceedings of the European Conference on Computer Vision (ECCV); 2018.
- View Article
- Google Scholar
38. Heo B, Lee M, Yun S, Choi JY, editors. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. Proceedings of the AAAI Conference on Artificial Intelligence; 2019.
- View Article
- Google Scholar

[ref1] 1. Hinton GE, Krizhevsky A, Sutskever I. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012;25(1106–1114):1.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014. Available from: https://arxiv.org/abs/1409.1556v1#.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al., editors. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition; 2015.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. He K, Zhang X, Ren S, Sun J, editors. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Long J, Shelhamer E, Darrell T, editors. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition; 2015.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015;28.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint arXiv:150302531. 2015. Available from: https://arxiv.org/abs/1503.02531.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. He R, Sun S, Yang J, Bai S, Qi X, editors. Knowledge distillation as efficient pre-training: Faster convergence, higher data-efficiency, and better transferability. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Li H, Kadav A, Durdanovic I, Samet H, Graf HP. Pruning filters for efficient convnets. arXiv preprint arXiv:160808710. 2016. Available from: https://arxiv.org/abs/1608.08710v2.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Li Y, Adamczewski K, Li W, Gu S, Timofte R, Van Gool L, editors. Revisiting random channel pruning for neural network compression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Hong C, Kim H, Baik S, Oh J, Lee KM, editors. Daq : Channel-wise distribution-aware quantization for deep image super-resolution networks. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2022.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Han T, Li D, Liu J, Tian L, Shan Y, editors. Improving low-precision network quantization via bin regularization. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Luo Y, Zhao X-L, Meng D, Jiang T-X, editors. HLRTF: Hierarchical Low-Rank Tensor Factorization for Inverse Problems in Multi-Dimensional Imaging. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Zhao B, Cui Q, Song R, Qiu Y, Liang J, editors. Decoupled knowledge distillation. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition; 2022.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:14126550. 2014. Available from: https://arxiv.org/abs/1412.6550.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref16] 16. Chen P, Liu S, Zhao H, Jia J, editors. Distilling knowledge via knowledge review. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref17] 17. Tian Y, Krishnan D, Isola P. Contrastive representation distillation. arXiv preprint arXiv:191010699. 2019. Available from: https://arxiv.org/abs/1910.10699v1.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref18] 18. Yao L, Pi R, Xu H, Zhang W, Li Z, Zhang T, editors. G DetKD: towards general distillation framework for object detectors via contrastive and semantic-guided feature imitation. Proceedings of the IEEE/CVF international conference on computer vision; 2021.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref19] 19. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S, editors. Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref20] 20. Lin M, Ji R, Wang Y, Zhang Y, Zhang B, Tian Y, et al., editors. Hrank : Filter pruning using high-rank feature map. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref21] 21. Wang Z, Li C, Wang X, editors. Convolutional neural network pruning with structural redundancy reduction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref22] 22. Kim D, Lee J, Ham B, editors. Distance-aware quantization. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref23] 23. Lin S, Ji R, Chen C, Tao D, Luo J. Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE transactions on pattern analysis and machine intelligence. 2018;41(12):2889–905. pmid:30281439
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref24] 24. Kim J, Park S, Kwak N. Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems. 2018;31.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref25] 25. Heo B, Kim J, Yun S, Park H, Kwak N, Choi JY, editors. A comprehensive overhaul of feature distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref26] 26. Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:161203928. 2016. Available from: https://arxiv.org/abs/1612.03928v2.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref27] 27. Yim J, Joo D, Bae J, Kim J, editors. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref28] 28. Srinivas S, Fleuret F, editors. Knowledge transfer with jacobian matching. International Conference on Machine Learning; 2018: PMLR.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref29] 29. Hu J, Shen L, Sun G, editors. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition; 2018.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref30] 30. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. 2009.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref31] 31. Zagoruyko S, Komodakis N. Wide residual networks. arXiv preprint arXiv:160507146. 2016. Available from: https://arxiv.org/abs/1605.07146v1.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref32] 32. Huang Z, Wang N. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:170701219. 2017. Available from: https://arxiv.org/abs/1707.01219.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref33] 33. Ahn S, Hu SX, Damianou A, Lawrence ND, Dai Z, editors. Variational information distillation for knowledge transfer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref34] 34. Peng B, Jin X, Liu J, Li D, Wu Y, Liu Y, et al., editors. Correlation congruence for knowledge distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref35] 35. Park W, Kim D, Lu Y, Cho M, editors. Relational knowledge distillation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref36] 36. Tung F, Mori G, editors. Similarity-preserving knowledge distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref37] 37. Passalis N, Tefas A, editors. Learning deep representations with probabilistic knowledge transfer. Proceedings of the European Conference on Computer Vision (ECCV); 2018.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref38] 38. Heo B, Lee M, Yun S, Choi JY, editors. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. Proceedings of the AAAI Conference on Artificial Intelligence; 2019.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

Figures

Abstract

Introduction

Related work

a) Model Compression.

b) Knowledge Distillation.

(b1) Middle layer.

(b2) Bottom layer.

(b3) Multilayer.

Our method

• Notation.

• Features extraction and correction.

• Feature fusion pyramid.

• Hyperparameter p.

Experiment

• Implement Details

a) Dataset.

b) Models.

c) Setting.

• Results

a) CIFAR-10.

b) CIFAR-100.

• Ablation Study

a) Extract Location.

b) Fusion Method.

• Extension

Conclusions and future work

References