Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Illustration of carbon footprints used by different deep models while (a) training on CIFAR 100 (in log scale) and (b-c) inferring on evaluation set. ResNet18 is a deeper model with 11.2M parameters, resulting in higher inference time (4.7 sec.) and CO2 emission (0.087 g). To minimize this, using ResNet18 as a teacher, we train two student models, MobileNetV2 (student 1) and ShuffleNetV2 (student 2), following the traditional KD process. This training costs significant carbon footprints (red and green dashed curves in (a)) with an accuracy increment from learning the teacher model (black dotted curve in (a)). However, as expected, both students consume less time and CO2 during inference (red and green shaded bars in (b) and (c)). We aim to reduce the training cost and CO2 production of the KD process while using the same students (red and green solid curves in (a)) and maintain similar accuracy and inference costs (solid red and green bars in (b) and (c)) in comparison with the costly KD training.

More »

Fig 1 Expand

Fig 2.

Block diagrams of KD architectures while training teacher, and student, models.

(a) Given input X, trainable (indicated as green) learns to produce logits hm after a softmax activation. Cross-Entropy loss () is used to train the teacher model. (b) Trainable student model, (indicated as green) leanrs from a frozen (indicated as red) teacher, . The teacher and student produce unactivated logits am and as, respectively. as are activated using a softmax containing the hyperparameter τ producing soft logits om. Similarly, os is produced for the teacher. The soft logits are used to calculate KLD loss (). The hard logits from the student hs are further used to calculate . acts as additional supervision for the student to learn better.

More »

Fig 2 Expand

Fig 3.

Impact of using different τ for ResNet18 (teacher) and MobileNetV2 (student) models using (a) Tiny ImageNet datasets. We notice a significant performance variance across different τ values. (b) The impact of using different batch sizes for our proposed stochastic solution. Similar performance across different batch sizes shows that our proposal does not depend on training batch sizes.

More »

Fig 3 Expand

Table 1.

Carbon footprints of Teacher (ResNet18) and Student (MobileNetV2) model before KD (top two rows), traditional KD [26] and Ours approach on three different datasets.

↑ (↓) means higher (lower) is better.

More »

Table 1 Expand

Table 2.

Summary of datasets used in the study.

More »

Table 2 Expand

Table 3.

Information about different architecture models used in this study.

More »

Table 3 Expand

Table 4.

Carbon footprints and performance for different KD approaches.

Here, for image recognition, we use ResNet18 (teacher) and MobileNetV2 (student) models. For object detection, we use VGG16 (teacher) and MobileNetV2 (student). We report average results after running the same program five times. ↑ (↓) means higher (lower) is better.

More »

Table 4 Expand

Table 5.

Different performance metrics of ResNet18-MobileNetV2 as teacher-student model combination using CIFAR 10 dataset.

In addition to maintaining low carbon footprints, Ours method performs similarly to the KD [26] method.

More »

Table 5 Expand

Fig 4.

Visual illustration of output produced by No compression, KD [26] and Ours methods.

(a) Output logits for a sample ‘bird’ image from the CIFAR 10. (b) Grad-Cam visualization of four sample images from the CIFAR 10 using the MobileNetV2. ResNet18 is used as. (c) Confusion matrices of prediction.

More »

Fig 4 Expand

Table 6.

Accuracy and carbon footprints obtained on different data-free knowledge distillation (DFKD) techniques.

↑ (↓) means higher (lower) is better.

More »

Table 6 Expand

Table 7.

Performance of different student-teacher combinations using Tiny ImageNet dataset.

Our stochastic method consistently performs a lower carbon footprint than the tuned KD approach, keeping similar accuracy. ↑ (↓) means higher (lower) is better.

More »

Table 7 Expand

Table 8.

Comparison among quantization and KD methods for model compression.

Experiments are done on MobileNetV2 architecture on the CIFAR 10 dataset. KD techniques use ResNet18 as the teacher. Ours method achieves the best performance in both accuracy and carbon footprint metrics.

More »

Table 8 Expand