Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Shuangrui Ding*, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong ;

Abstract


"In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across three multi-object video segmentation tasks. Specifically, we achieve over 5 points of improvement in terms of FG-ARI on complex real-world DAVIS-17-Unsupervised and YouTube-VIS-19 compared to the previous best result. The code and checkpoint are released at https://github.com/shvdiwnkozbw/SSL-UVOS."

Related Material


[pdf] [supplementary material] [DOI]