Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks

Manyuan Zhang*, Guanglu Song, Xiaoyu Shi, Yu Liu, Hongsheng Li ;

Abstract


"In this paper, we investigate how to conduct transfer learning to adapt Stable Diffusion to downstream visual dense prediction tasks such as semantic segmentation and depth estimation. We focus on fine-tuning the Stable Diffusion model, which has demonstrated impressive abilities in modeling image details and high-level semantics. Through our experiments, we have three key insights. Firstly, we demonstrate that for dense prediction tasks, the denoiser of Stable Diffusion can serve as a stronger feature encoder compared to visual-language models pre-trained with contrastive training (e.g., CLIP). Secondly, we show that the quality of extracted features is influenced by the diffusion sampling step t, sampling layer, cross-attention map, model generation capacity, and textual input. Features from Stable Diffusion UNet’s upsampling layers and earlier denoising steps lead to more discriminative features for transfer learning to downstream tasks. Thirdly, we find that tuning Stable Diffusion to downstream tasks in a parameter-efficient way is feasible. We first extensively investigate currently popular parameter-efficient tuning methods. Then we search for the best protocol for effective tuning via reinforcement learning and achieve better tuning results with fewer tunable parameters."

Related Material


[pdf] [DOI]