Mitigating carbon footprint for knowledge distillation based deep learning model compression

doi:10.1371/journal.pone.0285668

Fig 1.

Illustration of carbon footprints used by different deep models while (a) training on CIFAR 100 (in log scale) and (b-c) inferring on evaluation set. ResNet18 is a deeper model with 11.2M parameters, resulting in higher inference time (4.7 sec.) and CO₂ emission (0.087 g). To minimize this, using ResNet18 as a teacher, we train two student models, MobileNetV2 (student 1) and ShuffleNetV2 (student 2), following the traditional KD process. This training costs significant carbon footprints (red and green dashed curves in (a)) with an accuracy increment from learning the teacher model (black dotted curve in (a)). However, as expected, both students consume less time and CO₂ during inference (red and green shaded bars in (b) and (c)). We aim to reduce the training cost and CO₂ production of the KD process while using the same students (red and green solid curves in (a)) and maintain similar accuracy and inference costs (solid red and green bars in (b) and (c)) in comparison with the costly KD training.

More »

Expand

Fig 2.

Block diagrams of KD architectures while training teacher, and student, models.

(a) Given input X, trainable (indicated as green) learns to produce logits h_m after a softmax activation. Cross-Entropy loss () is used to train the teacher model. (b) Trainable student model, (indicated as green) leanrs from a frozen (indicated as red) teacher, . The teacher and student produce unactivated logits a_m and a_s, respectively. a_s are activated using a softmax containing the hyperparameter τ producing soft logits o_m. Similarly, o_s is produced for the teacher. The soft logits are used to calculate KLD loss (). The hard logits from the student h_s are further used to calculate . acts as additional supervision for the student to learn better.

More »

Expand

Fig 3.

Impact of using different τ for ResNet18 (teacher) and MobileNetV2 (student) models using (a) Tiny ImageNet datasets. We notice a significant performance variance across different τ values. (b) The impact of using different batch sizes for our proposed stochastic solution. Similar performance across different batch sizes shows that our proposal does not depend on training batch sizes.

More »

Expand

Table 1.

Carbon footprints of Teacher (ResNet18) and Student (MobileNetV2) model before KD (top two rows), traditional KD [26] and Ours approach on three different datasets.

↑ (↓) means higher (lower) is better.

More »

Expand

Table 2.

Summary of datasets used in the study.

More »

Expand

Comparison among quantization and KD methods for model compression.

Experiments are done on MobileNetV2 architecture on the CIFAR 10 dataset. KD techniques use ResNet18 as the teacher. Ours method achieves the best performance in both accuracy and carbon footprint metrics.

More »

Expand