ECVA | European Computer Vision Association

Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning

Ali Cheraghian*, Zeeshan Hayder, Sameeea Ramasinghe, Shafin Rahman, Javad Jafaryahya, Lars Petersson, Mehrtash Harandi ;

Abstract

"In recent years, robust pre-trained foundation models have been successfully used in many downstream tasks. Here, we would like to use such powerful models to address the problem of few-shot class incremental learning (FSCIL) tasks on 3D point cloud objects. Our approach is to reprogram the well-known CLIP-based foundation model (trained on 2D images and text pairs) for this purpose. The CLIP model works by ingesting 2D images, so to leverage it in our context, we project the 3D object point cloud onto 2D image space to create proper depth maps. For this, prior works consider a fixed and non-trainable set of camera poses. In contrast, we propose to train the network to find a projection that best describes the object and is appropriate for extracting 2D image features from the CLIP vision encoder. Directly using the generated depth map is not suitable for the CLIP model, so we apply the model reprogramming paradigm to the depth map to augment the foreground and background to adapt it. This removes the need for modification or fine-tuning of the foundation model. In the setting we have investigated, we have limited access to data from novel classes, resulting in a problem with overfitting. Here, we address this problem via the use of a prompt engineering approach using multiple GPT-generated text descriptions. Our method, C3PR, successfully outperforms existing FSCIL methods on ModelNet, ShapeNet, ScanObjectNN, and CO3D datasets. The code is available at https://github.com/alichr/C3PR."

Related Material

[pdf] [DOI]