ECVA | European Computer Vision Association

SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding

Han Xiao, Wenzhao Zheng, Sicheng Zuo, Peng Gao, Jie Zhou, Jiwen Lu* ;

Abstract

"Vision transformers have demonstrated promising results and become core components in many tasks. Most existing works focus on context feature extraction and incorporate spatial information through additional positional embedding. However, they only consider the local positional information within each image token and cannot effectively model the global spatial relations of the underlying scene. To address this challenge, we propose an efficient vision transformer architecture, SpatialFormer, with explicit spatial understanding for generalizable image representation learning. Specifically, we accompany the image tokens with adaptive spatial tokens to represent the context and spatial information respectively. We initialize the spatial tokens with positional encoding to introduce general spatial priors and augment them with learnable embeddings to model adaptive spatial information. For better generalization, we employ a decoder-only overall architecture and propose a bilateral cross-attention block for efficient interactions between context and spatial tokens. SpatialFormer learns transferable image representations with explicit scene understanding, where the output spatial tokens can further serve as enhanced initial queries for task-specific decoders for better adaptations to downstream tasks. Extensive experiments on image classification, semantic segmentation, and 2D/3D object detection tasks demonstrate the efficiency and transferability of the proposed SpatialFormer architecture. Code is available at https://github.com/Euphoria16/SpatialFormer."

Related Material

[pdf] [supplementary material] [DOI]