IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models

Kai Huang*, Hao Zou, Ye Xi, Bochen Wang, Zhen Xie, Liang Yu ;

Abstract


"Inspired by the remarkable achievements of Large Language Models (LLMs), Large Vision-Language Models (LVLMs) have likewise experienced significant advancements. However, the increased computational cost and token budget occupancy associated with lengthy visual tokens pose significant challenge to the practical applications. Considering that not all visual tokens are essential to the final response, selectively pruning redundant visual tokens can effectively alleviate this challenge. In this paper, we present a novel Instruction-guided Visual Token Pruning (IVTP) approach for LVLMs, which is designed to strike a better balance between computational efficiency and the performance. Specifically, a Group-wise Token Pruning (GTP) module based on attention rollout is integrated into the grouped transformer layer to achieve intra-group attention aggregation via residual connection, thereby improving the assessment of visual token importance, especially for LVLMs with a frozen visual encoder. We then extend the module to LLM in order to further filter out visual tokens that are pertinent to the current textual instructions, by introducing a semantically related pseudo CLS token to serve as a reference for token pruning. This two-stage token pruning mechanism permits a systematic and efficient reduction in the quantity of visual tokens while preserving essential visual information. We apply the proposed method to the most representative LVLM, i.e. LLaVA-1.5. Experimental results demonstrate that when the number of visual tokens is reduced by 88.9%, the computational complexity is decreased by over 46%, with only an average 1.0% accuracy drop across 12 benchmarks, and remarkably surpasses the state-of-the-art token pruning methods."

Related Material


[pdf] [supplementary material] [DOI]