EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
jiazhou zhou*, Xu Zheng, Yuanhuiyi Lyu, Lin Wang
;
Abstract
"In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-language models (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial. Intuitively, we need to address two key challenges: 1) how to generalize CLIP’s visual encoder to event data while fully leveraging events’ unique properties, , sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, , image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile generates event prompts for modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind’s generalization ability across diverse datasets. With the proposed event encoder, text encoder, and image encoder, a novel Hierarchical Triple Contrastive Alignment HTCA module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We evaluate various settings, including fine-tuning and few-shot on three benchmarks and our EventBind achieves new state-of-art accuracy compared with the previous methods, such as on N-Caltech101 (+5.34% and +1.70%) and N-Imagenet (+5.65% and +1.99%) with fine-tuning and 20-shot settings respectively. Moreover, our EventBind can be flexibly extended to the event retrieval task using text or image queries, showing plausible performance."
Related Material
[pdf]
[supplementary material]
[DOI]