ECVA | European Computer Vision Association

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

Kailin Li*, Jingbo Wang, Lixin Yang, Cewu Lu*, Bo Dai ;

Abstract

"Generating natural human grasps necessitates consideration of not just object geometry but also semantic information. Solely depending on object shape for grasp generation confines the applications of prior methods in downstream tasks. This paper presents a novel semantic-based grasp generation method, termed , which generates a static human grasp pose by incorporating semantic information into the grasp representation. We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures in accordance with language instructions. A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space. To facilitate the training of , we compile a large-scale, grasp-text-aligned dataset named , featuring over 300k detailed captions and 50k diverse grasps. Experimental findings demonstrate that efficiently generates natural human grasps in alignment with linguistic intentions. Our code, models, and dataset will be made publicly available.Our code, models, and dataset are available publicly at: ."

Related Material

[pdf] [supplementary material] [DOI]