ECVA | European Computer Vision Association

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Jingye Chen*, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei ;

Abstract

"The diffusion model has been proven a powerful generative model in recent years, yet it remains a challenge in generating visual text. Although existing work has endeavored to enhance the accuracy of text rendering, these methods still suffer from several drawbacks, such as (1) limited flexibility and automation, (2) constrained capability of layout prediction, and (3) restricted diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering while taking these three aspects into account. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords and placing the text in optimal positions for text rendering. Secondly, we utilize the language model within the diffusion model to encode the position and content of keywords at the line level. Unlike previous methods that employed tight character-level guidance, our approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants and GPT-4V, validating TextDiffuser-2’s capacity to achieve a more rational text layout and generation with enhanced diversity. Furthermore, the proposed methods are compatible with existing text rendering techniques, such as TextDiffuser and GlyphControl, serving to enhance automation and diversity, as well as augment the rendering accuracy. For instance, by using the proposed layout planner, TextDiffuser is capable of rendering text with more aesthetically pleasing line breaks and alignment, meanwhile obviating the need for explicit keyword specification. Furthermore, GlyphControl can leverage the layout planner to achieve diverse layouts without the necessity for user-specified glyph images, and the rendering F-measure can be boosted by 6.51% when using the proposed layout encoding training technique. The code and model are available at https: //aka.ms/textdiffuser-2."

Related Material

[pdf] [supplementary material] [DOI]