M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions
Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Zhuoyuan Li, Gang Yu, Tao Chen*
;
Abstract
"Recently, the understanding of the 3D world has garnered increased attention, facilitating autonomous agents to perform further decision-making. However, the majority of existing 3D vision-language datasets and methods are often limited to specific tasks, limiting their applicability in diverse scenarios. The recent advance of Large Language Models (LLMs) and Multi-modal Language Models (MLMs) has shown mighty capability in solving various language and image tasks. Therefore, it is interesting to unlock MLM’s potential to be an omni 3D assistant for wider tasks. However, current MLMs’ research has been less focused on 3D due to the scarcity of large-scale visual-language datasets. In this work, we introduce M3DBench, a comprehensive multi-modal instruction dataset for complex 3D environments with over 320k instruction-response pairs that: 1) supports general interleaved multi-modal instructions with text, user clicks, images, and other visual prompts, 2) unifies diverse regionand scene-level 3D tasks, composing various fundamental abilities in real-world 3D environments. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding interleaved multi-modal instructions. With extensive quantitative and qualitative experiments, we show the effectiveness of our dataset and baseline model in understanding complex human-environment interactions and accomplishing general 3D-centric tasks. We will release the data and code to accelerate future research on developing 3D MLMs."
Related Material
[pdf]
[supplementary material]
[DOI]