ECVA | European Computer Vision Association

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi*, Kaisheng Ma* ;

Abstract

"This paper presents , the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. is built upon an improved 3D encoder by extending [?] to that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing as the 3D point cloud input encoder for LLMs, is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. and achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding."

Related Material

[pdf] [supplementary material] [DOI]