ECVA | European Computer Vision Association

MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning

Dongyao Jiang, Hui Chen, Haodong Jing, Yongqiang Ma, Nanning Zheng* ;

Abstract

"Compositional Zero-Shot Learning (CZSL) aims to classify unseen state-object compositions using seen primitives. Previous methods commonly map an identical primitive from different compositions to the same area within embedding space, aiming to establish primitive representation or assess decoding proficiency. However, relying solely on the intersection area of primitive concepts might overlook nuanced semantics due to conditional variance, thereby limiting the model’s capacity to generalize to unseen compositions. In contrast, our approach constructs primitive representations by considering the union area of primitives. We propose a Multiple Representation of Single Primitive learning framework (termed MRSP) for CZSL, which captures composition-relevant features through a state-object-composition three-branch cross-attention architecture. Specifically, the input image feature cross-attends to multiple state, object, and composition features and the prediction scores are adaptively adjusted by combining the output of each branch. Extensive experiments on three benchmarks in both closed-world and open-world settings showcase the superior effectiveness of MRSP."

Related Material

[pdf] [DOI]