ECVA | European Computer Vision Association

Visual Alignment Pre-training for Sign Language Translation

Peiqi Jiao, Yuecong Min, Xilin Chen* ;

Abstract

"Sign Language Translation (SLT) aims to translate sign videos into text sentences. While gloss sequences, the written approximation of sign videos, provide informative alignment supervision for visual representation learning in SLT, the associated high cost of gloss annotations hampers the scalability. Recent works have yet to achieve satisfactory results without gloss annotations. In this study, we attribute the challenge to the flexible correspondence between visual and textual tokens, and aim to address it by constructing a gloss-like constraint from text sentences. Specifically, we propose a Visual Alignment Pre-training (VAP) scheme to exploit visual information by aligning visual and textual tokens in a greedy manner. The VAP scheme enhances visual encoder in capturing semantic-aware visual information and facilitates better adaptation with translation modules pre-trained on large-scale corpora. Experimental results across four SLT benchmarks demonstrate the effectiveness of VAP, which can generate reasonable alignments and significantly narrow the performance gap with gloss-based methods."

Related Material

[pdf] [supplementary material] [DOI]