Tailored knowledge distillation with automated loss function learning

doi:10.1371/journal.pone.0325599

Fig 1.

Training pipeline of LKD.

The LKD training procedure follows a bi-level optimization scheme with an inner loop for student training and an outer loop for loss network updates. In the inner loop, we train the student model using the current LKD loss, guided by the pre-trained teacher model, and record some of the iterations’ student model parameters. In the outer loop, we evaluate the student model on a validation set using the CE loss, then update the LKD network parameters based on the validation gradients.

More »

Expand

Fig 2.

Architectures of LKD losses.

More »

Expand

Table 1.

Performance on ImageNet dataset. We train the models following the standard training strategy with pre-trained teacher networks ResNet-34 and ResNet-50 provided by Torchvision [42].

More »

Expand

Table 2.

Results on CIFAR-100 dataset with homogeneous architecture style of teacher and student. The top and bottom model names represent the teacher and student, respectively.

More »