SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
Xiaoyu Liu, Yuxiang Wei, Ming Liu*, Xianhui Lin, Peiran Ren, xuansong xie, Wangmeng Zuo
;
Abstract
"Recent text-to-image generation methods such as ControlNet have achieved remarkable success in controlling image layouts, where the generated images by the default model are constrained to strictly follow the visual conditions (e.g., depth maps). However, in practice, the conditions usually provide only a rough layout, and we argue that the text prompts can more faithfully reflect user intentions. For handling the disagreements between the text prompts and rough visual conditions, we propose a novel text-to-image generation method dubbed SmartControl, which is designed to align well with the text prompts while adaptively keeping useful information from the visual conditions. The key idea of our SmartControl is to relax the constraints on areas that conflict with the text prompts in visual conditions, and two main procedures are required to achieve such a flexible generation. In specific, we extract information from the generative priors of the backbone model (e.g., ControlNet), which effectively represents consistency between the text prompt and visual conditions. Then, a Control Scale Predictor is designed to identify the conflict regions and predict the local control scales. For training the proposed method, a dataset with text prompts and rough visual conditions is constructed. It is worth noting that, even with a limited number (e.g., 1,000∼2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments are conducted on four typical visual condition types, and our SmartControl can achieve a superior performance against state-of-the-art methods. Source code, pre-trained models, and datasets will be publicly available."
Related Material
[pdf]
[supplementary material]
[DOI]