Learning Linguistic Association towards Efficient Text-Video Retrieval

Sheng Fang, Shuhui Wang, Junbao Zhuo, Xinzhe Han, Qingming Huang ;

Abstract


"Text-video retrieval attracts growing attention recently. A dominant approach is to learn a common space for aligning two modalities. However, video deliver richer content than text in general situations and captions usually miss certain events or details in the video. The information imbalance between two modalities makes it difficult to align their representations. In this paper, we propose a general framework, LINguistic ASsociation (LINAS), which utilizes the complementarity between captions corresponding to the same video. Concretely, we first train a teacher model taking extra relevant captions as inputs, which can aggregate language semantics for obtaining more comprehensive text representations. Since the additional captions are inaccessible during inference, Knowledge Distillation is employed to train a student model with a single caption as input. We further propose Adaptive Distillation strategy, which allows the student model to adaptively learn the knowledge from the teacher model. This strategy also suppresses the spurious relations introduced during the linguistic association. Extensive experiments demonstrate the effectiveness and efficiency of LINAS with various baseline architectures on benchmark datasets."

Related Material


[pdf] [supplementary material] [DOI]