Improving Zero-Shot Generalization for CLIP with Variational Adapter

Ziqian Lu, Fengli Shen, Mushui Liu, Yunlong Yu*, Xi Li ;

Abstract


"The excellent generalization capability of pre-trained Vision-Language Models (VLMs) makes fine-tuning VLMs for downstream zero-shot tasks a popular choice. Despite achieving promising performance in the professionality of base classes, most existing fine-tuned methods suffer from feature confusion of novel classes, resulting in unsatisfactory transferability. To address this problem, we propose a divide-and-conquer approach called Prompt-based Variational Adapter (PVA) that explicitly reduces the prediction bias by separating base and novel samples. Specifically, we design two variational adapters with learnable textual tokens to align latent representations for each modality in a shared latent space. Once trained, we can separate novel samples from entangled space using the similarity metric of latent features, i.e., converting confusion task into two independent ones (One for base classes and the other for novel classes). Moreover, to improve the transferability for novel classes, we further refine the output features of the learned adapters with the global features via a residual connection. We conduct extensive experiments on Generalized Zero-Shot Learning and Cross-Dataset Transfer Learning to demonstrate the superiority of our approach and establish a new state-of-the-art on four popular benchmarks."

Related Material


[pdf] [DOI]