SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

Mengxue Qu, Yu Wu, Wu Liu, Qiqi Gong, Xiaodan Liang, Olga Russakovsky, Yao Zhao, Yunchao Wei ;

Abstract


"In this paper, we investigate how to achieve better referring visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e, a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. With such a principle, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize the rest parameters to compel the model to be better optimized based on an enhanced encoder. With such a simple training mechanism, our SiRi can significantly outperform previous approaches on three popular benchmarks. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. More importantly, the effectiveness of SiRi, are further verified by other model and other V-L tasks. Code is available in the supplementary materials."

Related Material


[pdf] [supplementary material] [DOI]