ECVA | European Computer Vision Association

MultiGen: Zero-shot Image Generation from Multi-modal Prompts

Zhi-Fan Wu*, Lianghua Huang, Wei Wang, Yanheng Wei, Yu Liu ;

Abstract

"The field of text-to-image generation has witnessed substantial advancements in the preceding years, allowing the generation of high-quality images based solely on text prompts. However, accurately describing objects through text alone is challenging, necessitating the integration of additional modalities like coordinates and images for more precise image generation. Existing methods often require fine-tuning or only support using single object as the constraint, leaving the zero-shot image generation from multi-object multi-modal prompts as an unresolved challenge. In this paper, we propose MultiGen, a novel method designed to address this problem. Given an image-text pair, we obtain object-level text, coordinates and images, and integrate the information into an “augmented token” for each object. The augmented tokens serve as additional conditions and are trained alongside text prompts in the diffusion model, enabling our model to handle multi-object multi-modal prompts. To manage the absence of modalities during inference, we leverage a coordinate model and a feature model to generate object-level coordinates and features based on text prompts. Consequently, our method can generate images from text prompts alone or from various combinations of multi-modal prompts. Through extensive qualitative and quantitative experiments, we demonstrate that our method not only outperforms existing methods but also enables a wide range of tasks."

Related Material

[pdf] [supplementary material] [DOI]