OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

Changsheng Lu*, Zheyuan Liu, Piotr Koniusz* ;

Abstract


"Exploiting foundation models (, CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (, “the nose of a cat”), or the visual prompt (, support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on multimodal prompting is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts like “Can you detect the nose and ears of a cat?” In this work, we open the prompt diversity in three aspects: modality, semantics (seen unseen), and language, to enable a more general zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages a multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated in visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also find large language model (LLM) is a good parser, which achieves over 96% accuracy when parsing keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways of dealing with unseen text and diverse texts. The source code and data are available at https://github.com/AlanLuSun/OpenKD. Recently the (visual) prompt based keypoint detection has attracted the research interest in community as it provides more general detection paradigm compared to traditional close-set keypoint detection. Existing prompts include the textual prompt, visual prompt, or the both. However, the diversity of text prompt is quite limited in semantics and only using stereotype templates, which severely hinders the real-world application. In this work, we are further opening the prompt diversity, ranging from the easy text prompt to hard and unseen text prompt, pushing towards a more general keypoint detection by transferring the knowledge of large-scale pre-trained models such as vision-language model CLIP and large-language model Vicuna/Llama 2. Specifically, we propose a novel OpenKD model which consists of lexical/text parsing module, dual-modal prompting mechanism, dual-contrastive loss for signal alignment and knowledge transfer, and others. To test the model efficacy, we construct diverse text prompt sets for existing keypoint detection datasets. Our model not only supports both visual or textual modality prompting, but also has the capability to infer keypoint locations of unseen text prompts, realizing the first zero-shot novel keypoint detection. The experiments highlight the effectiveness of the proposed approach."

Related Material


[pdf] [supplementary material] [DOI]