MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Shitao Tang*, Jiacheng Chen, Dilin Wang, Chengzhou Tang, Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Furukawa, Rakesh Ranjan ;

Abstract


"This paper presents a neural architecture for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. achieves superior flexibility and scalability with two surprisingly simple ideas: 1) A “pose-free architecture” where standard self-attention among 2D latent features learns 3D consistency across an arbitrary number of conditional and generation views without explicitly using camera pose information; and 2) A “view dropout strategy” that discards a substantial number of output views during training, which reduces the training-time memory footprint and enables dense and high-resolution view synthesis at test time. We use the Objaverse for training and the Google Scanned Objects for evaluation with standard novel view synthesis and 3D reconstruction metrics, where significantly outperforms the current state of the arts. We also demonstrate a text-to-3D application example by combining with a text-to-image generative model. The project page is at https://mvdiffusion-plusplus.github.io."

Related Material


[pdf] [supplementary material] [DOI]