ECVA | European Computer Vision Association

Improving Knowledge Distillation via Regularizing Feature Direction and Norm

Yuzhu Wang, Lechao Cheng*, Manni Duan, Yongheng Wang, Zunlei Feng, Shu Kong ;

Abstract

"Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained teacher neural network to train a small student network . Treating teacher’s feature as knowledge, prevailing methods train student by aligning its features with the teacher’s, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill teacher’s knowledge, simply forcing this alignment does not directly contribute to the student’s performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better student classifier. We are motivated to regularize student features at the penultimate layer using teacher towards training a better student classifier. Specifically, we present a rather simple method that uses teacher’s class-mean features to align student features w.r.t their direction. Experiments show that this significantly improves KD performance. Moreover, we empirically find that student produces features that have notably smaller norms than teacher’s, motivating us to regularize student to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes student by simultaneously (1) aligning the direction of its features with the teacher class-mean feature, and (2) encouraging it to produce large-norm features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset)."

Related Material

[pdf] [supplementary material] [DOI]