ECVA | European Computer Vision Association

Conceptual Codebook Learning for Vision-Language Models

Yi Zhang*, Ke Yu, Siqi Wu, Zhihai He* ;

Abstract

"In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs). CoCoLe aims to address the challenge of enhancing the generalization capability of VLMs while adapting them to downstream tasks in a few-shot setting. We recognize that visual concepts like shapes, colors, and textures are inherently transferable across different domains and are essential for generalization tasks. Motivated by this critical finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder’s outputs and the text encoder’s inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. This conceptual codebook learning method has been shown to improve the alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe."

Related Material

[pdf] [supplementary material] [DOI]