SignGen: End-to-End Sign Language Video Generation with Latent Diffusion
Fan Qi*, Yu Duan, Changsheng Xu, Huaiwen Zhang*
;
Abstract
"The seamless transformation of textual input into natural and expressive sign language holds profound societal significance. Sign language is not solely about hand gestures. It encompasses vital facial expressions and mouth movements essential for nuanced communication. Achieving both semantic precision and emotional resonance in text-to-sign language translation is of paramount importance. Our work pioneers direct end-to-end translation of text into sign language videos, encompassing a realistic representation of the entire body and facial expressions. We go beyond traditional diffusion models by tailoring the multi-modal conditions for sign language videos. Additionally, our modified motion-aware sign generation framework enhances alignment between text and visual cues in sign language, further improving the quality of the generated sign language videos. Extensive experiments show that our approach significantly outperforms the state-of-the-art approaches in terms of semantic consistency, naturalness, and expressiveness, presenting benchmark quantitative results on the RWTH-2014, RWTH-2014-T, WLASL, CSL-Daily, and AUTSL. Our code is available at https://github.com/mingtiannihao/SignGen."
Related Material
[pdf]
[supplementary material]
[DOI]