Distilling Knowledge from Large-Scale Image Models for Object Detection

Gang Li*, Wenhai Wang, Xiang Li, Ziheng Li, Jian Yang, Jifeng Dai, Yu Qiao, Shanshan Zhang* ;

Abstract


"Large-scale image models have made great progress in recent years, pushing the boundaries of many vision tasks, , object detection. Considering that deploying large models is impractical in many scenes due to expensive computation overhead, this paper presents a new knowledge distillation method, which Distills knowledge from Large-scale Image Models for object detection (dubbed DLIM-Det). To this end, we make the following two efforts: (1) To bridge the gap between the teacher and student, we present a frozen teacher approach. Specifically, to create the teacher model via fine-tuning large models on a specific task, we freeze the pretrained backbone and only optimize the task head. This preserves the generalization capability of large models and gives rise to distinctive characteristics in the teacher. In particular, when equipped with DEtection TRansformers (DETRs), the frozen teacher exhibits sparse query locations, thereby facilitating the distillation process. (2) Considering that large-scale detectors are mainly based on DETRs, we propose a Query Distillation (QD) method specifically tailored for DETRs. The QD performs knowledge distillation by leveraging the spatial positions and pair-wise relations of teacher’s queries as knowledge to guide the learning of object queries of the student model. Extensive experiments are conducted on various large-scale image models with parameter sizes ranging from 200M to 1B. Our DLIM-Det improves the student with Swin-Tiny by 3.1 mAP when the DINO detector with Swin-Large is used as the teacher. Besides, even when the teacher has 30 times more parameters than the student, DLIM-Det still attains a +2.9 distillation gain."

Related Material


[pdf] [supplementary material] [DOI]