ReMamber: Referring Image Segmentation with Mamba Twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong*, Ya Zhang, Yanfeng Wang* ;

Abstract


"Referring Image Segmentation (RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose , a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve competitive results on three challenging benchmarks with a simple and efficient architecture. Moreover, we conduct thorough analyses of and discuss other fusion designs using Mamba. These provide valuable perspectives for future research. The code has been released at: https:// github.com/yyh-rain-song/ReMamber."

Related Material


[pdf] [supplementary material] [DOI]