This website requires JavaScript.

BD-KD: Balancing the Divergences for Online Knowledge Distillation

Ibtihel AmaraNazanin SepahvandBrett H. MeyerWarren J. GrossJames J. Clark
Dec 2022
Knowledge distillation (KD) has gained a lot of attention in the field ofmodel compression for edge devices thanks to its effectiveness in compressinglarge powerful networks into smaller lower-capacity models. Onlinedistillation, in which both the teacher and the student are learningcollaboratively, has also gained much interest due to its ability to improve onthe performance of the networks involved. The Kullback-Leibler (KL) divergenceensures the proper knowledge transfer between the teacher and student. However,most online KD techniques present some bottlenecks under the network capacitygap. By cooperatively and simultaneously training, the models the KL distancebecomes incapable of properly minimizing the teacher's and student'sdistributions. Alongside accuracy, critical edge device applications are inneed of well-calibrated compact networks. Confidence calibration provides asensible way of getting trustworthy predictions. We propose BD-KD: Balancing ofDivergences for online Knowledge Distillation. We show that adaptivelybalancing between the reverse and forward divergences shifts the focus of thetraining strategy to the compact student network without limiting the teachernetwork's learning process. We demonstrate that, by performing this balancingdesign at the level of the student distillation loss, we improve upon bothperformance accuracy and calibration of the compact student network. Weconducted extensive experiments using a variety of network architectures andshow improvements on multiple datasets including CIFAR-10, CIFAR-100,Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approachthrough comprehensive comparisons and ablations with current state-of-the-artonline and offline KD techniques.