ECVA | European Computer Vision Association

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

Nithin Gopalakrishnan Nair*, Jeya Maria Jose Valanarasu, Vishal Patel ;

Abstract

"Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation and spatially conditioned image generation. We can train the model end-to-end with paired data for most applications to obtain photorealistic generation quality. However, to add a task, one often needs to retrain the model from scratch using paired data across all modalities to retain good generation performance. This paper tackles this issue and proposes a novel strategy to scale a generative model across new tasks with minimal computation. During our experiments, we discovered that the variance maps of intermediate feature maps of diffusion models capture the conditioning intensity. Utilizing this prior information, we propose MaxFusion, an efficient strategy to scale up text-to-image generation models to accommodate new modality conditions. Specifically, we combine aligned features of multiple models, bringing a compositional effect. Our fusion strategy can be integrated into off-the-shelf models to enhance their generative prowess."

Related Material

[pdf] [supplementary material] [DOI]