ECVA | European Computer Vision Association

Efficient Vision Transformers with Partial Attention

Xuan-Thuy Vo*, Duy-Linh Nguyen, Adri Priadana, Kang-Hyun Jo* ;

Abstract

"As a core of Vision Transformer (ViT), self-attention has high versatility in modeling long-range spatial interactions because every query attends to all spatial locations. Although ViT achieves promising performance in visual tasks, self-attention’s complexity is quadratic with token lengths. This leads to challenging problems when adapting ViT models to downstream tasks that require high input resolutions. Previous arts have tried to solve this problem by introducing sparse attention such as spatial reduction attention, and window attention. One common point of these methods is that all image/window tokens are joined during computing attention weights. In this paper, we find out that there exist high similarities between attention weights and incur computation redundancy. To address this issue, this paper introduces novel attention, called partial attention, that learns spatial interactions more efficiently, by reducing redundant information in attention maps. Each query in our attention only interacts with a small set of relevant tokens. Based on partial attention, we propose an efficient and general vision Transformer, named PartialFormer, that attains good trade-offs between accuracy and computational costs across vision tasks. For example, on ImageNet-1K, PartialFormer-B3 surpasses Swin-T by 1.7% Top-1 accuracy while saving 25% GFLOPs, and Focal-T by 0.8% while saving 30% GFLOPs."

Related Material

[pdf] [supplementary material] [DOI]