Fig 1.
The LKD training procedure follows a bi-level optimization scheme with an inner loop for student training and an outer loop for loss network updates. In the inner loop, we train the student model using the current LKD loss, guided by the pre-trained teacher model, and record some of the iterations’ student model parameters. In the outer loop, we evaluate the student model on a validation set using the CE loss, then update the LKD network parameters based on the validation gradients.
Fig 2.
Architectures of LKD losses.
Table 1.
Performance on ImageNet dataset. We train the models following the standard training strategy with pre-trained teacher networks ResNet-34 and ResNet-50 provided by Torchvision [42].
Table 2.
Results on CIFAR-100 dataset with homogeneous architecture style of teacher and student. The top and bottom model names represent the teacher and student, respectively.
Table 3.
Results on CIFAR-100 dataset with heterogeneous architecture style of teacher and student.
Fig 3.
Performance of different distillation loss types.
Table 4.
Impact of parameter sampling strategies on KD from ResNet-34 to ResNet-18.
Table 5.
Comparison of Gaussian sampling and fixed-step sampling in LKD validation.
Table 6.
Impact of batch data consistency in LKD training on ImageNet.
Table 7.
Impact of training approaches on LKD loss and student model accuracy.