Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models

Mengyu Zheng*, Yehui Tang, Zhiwei Hao, Kai Han, Yunhe Wang, Chang Xu* ;

Abstract


"Multi-modal models such as CLIP possess remarkable zero-shot transfer capabilities, making them highly effective in continual learning tasks. However, this advantage is severely compromised by catastrophic forgetting, which undermines the valuable zero-shot learning abilities of these models. Existing methods predominantly focus on preserving zero-shot capabilities but often fall short in fully exploiting the rich modal information inherent in multi-modal models. In this paper, we propose a strategy to enhance both the zero-shot transfer ability and adaptability to new data distribution. We introduce a novel graph-based multi-modal proximity distillation approach that preserves the intra- and inter-modal information for visual and textual modalities. This approach is further enhanced with a sample re-weighting mechanism, dynamically adjusting the influence of teachers for each individual sample. Experimental results demonstrate a considerable improvement over existing methodologies, which illustrate the effectiveness of the proposed method in the field of continual learning. Code is available at github.com/myz-ah/AwoForget."

Related Material


[pdf] [supplementary material] [DOI]