Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable

En-hui Yang, Linfeng Ye* ;

Abstract


"To protect intellectual property of a deep neural network (DNN), two knowledge distillation (KD) related concepts are proposed: distillable DNN and KD-resistant DNN. A DNN is said to be distillable if used as a black-box input-output teacher, it can be distilled by a KD method to train a student model so that the distilled student outperforms the student trained alone with label smoothing (LS student) in terms of accuracy. A DNN is said to be KD-resistant with respect to a specific KD method if used as a black-box input-output teacher, it cannot be distilled by that specific KD method to yield a distilled student outperforming LS student in terms of accuracy. A new KD method called Markov KD (MKD) is further presented. When applied to nasty teachers trained by self-undermining KD, MKD makes those nasty teachers fully distillable, although those nasty teachers are shown to be KD-resistant with respect to state-of-the-art KD methods existing in the literature before our work. When applied to normal teachers, MKD yields distilled students outperforming those trained by KD from the same normal teachers by a large margin. More interestingly, MKD is capable of transferring knowledge from teachers trained in one domain to students trained in another domain."

Related Material


[pdf] [supplementary material] [DOI]