Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable
En-hui Yang, Linfeng Ye*
;
Abstract
"To protect intellectual property of a deep neural network (DNN), two knowledge distillation (KD) related concepts are proposed: distillable DNN and KD-resistant DNN. A DNN is said to be distillable if used as a black-box input-output teacher, it can be distilled by a KD method to train a student model so that the distilled student outperforms the student trained alone with label smoothing (LS student) in terms of accuracy. A DNN is said to be KD-resistant with respect to a specific KD method if used as a black-box input-output teacher, it cannot be distilled by that specific KD method to yield a distilled student outperforming LS student in terms of accuracy. A new KD method called Markov KD (MKD) is further presented. When applied to nasty teachers trained by self-undermining KD, MKD makes those nasty teachers fully distillable, although those nasty teachers are shown to be KD-resistant with respect to state-of-the-art KD methods existing in the literature before our work. When applied to normal teachers, MKD yields distilled students outperforming those trained by KD from the same normal teachers by a large margin. More interestingly, MKD is capable of transferring knowledge from teachers trained in one domain to students trained in another domain."
Related Material
[pdf]
[supplementary material]
[DOI]