ECVA | European Computer Vision Association

Large-scale Reinforcement Learning for Diffusion Models

Yinan Zhang*, Eric Tzeng, Yilun Du, Dmitry Kislyuk* ;

Abstract

"Text-to-image diffusion models are cutting-edge deep generative models that have demonstrated impressive capabilities in generating high-quality images. However, these models are susceptible to implicit biases originating from web-scale text-image training pairs, potentially leading to inaccuracies in modeling image attributes. This susceptibility can manifest as suboptimal samples, model bias, and images that do not align with human ethics and preferences. In this paper, we propose a scalable algorithm for enhancing diffusion models using Reinforcement Learning (RL) with a diverse range of reward functions, including human preference, compositionality, and social diversity over millions of images. We demonstrate how our approach significantly outperforms existing methods for aligning diffusion models with human preferences. We further illustrate how this substantially improves pretrained Stable Diffusion (SD) models, generating samples that are preferred by humans 80.3% of the time over those from the base SD model, while simultaneously enhancing object composition and diversity of the samples."

Related Material

[pdf] [supplementary material] [DOI]