ECVA | European Computer Vision Association

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia*, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang ;

Abstract

"3D vision-language (3dvl) grounding, which aims to align language with 3D physical environments, stands as a cornerstone in developing embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces two significant challenges: (i) the scarcity of paired 3dvl data to support grounded learning of 3D scenes, especially considering complexities within diverse object configurations, rich attributes, and intricate relationships; and (ii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these major challenges in 3D-VL by examining the potential of systematically upscaling 3D-VL learning in indoor scenes. We introduce the first million-scale 3D-VL dataset, , encompassing indoor scenes and comprising vision-language pairs collected from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (), for 3D-VL learning. Through extensive experiments, we showcase the effectiveness of by achieving performance on existing 3D visual grounding and question-answering benchmarks. We also show that the data scaling effect is not limited to , but is generally beneficial for models on tasks like 3D semantic segmentation. The vast potential of and is unveiled through zero-shot transfer experiments in challenging 3dvl tasks."

Related Material

[pdf] [supplementary material] [DOI]