ECVA | European Computer Vision Association

Dolfin: Diffusion Layout Transformers without Autoencoder

Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhuowen Tu* ;

Abstract

"In this paper, we introduce a new generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), that attains significantly improved modeling capability and transparency over the existing approaches. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we also design an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing neighboring objects’ rich local semantic correlations, such as alignment, size, and overlap. When evaluated on standard unconditional layout generation benchmarks, Dolfin notably outperforms previous methods across various metrics, such as FID, alignment, overlap, MaxIoU, and DocSim scores. Moreover, Dolfin’s applications extend beyond layout generation, making it suitable for modeling other types of geometric structures, such as line segments. Our experiments present both qualitative and quantitative results to demonstrate the advantages of Dolfin."

Related Material

[pdf] [supplementary material] [DOI]