Generalizing to Unseen Domains via Text-guided Augmentation

Daiqing Qi*, Handong Zhao, Aidong Zhang, Sheng Li ;

Abstract


"To avoid the high cost of collecting visual data from all test domains in the domain adaptation task, recent work takes advantage of the pre-trained large-scale vision language models and augment training data with only text descriptions (e.g.,“a photo/painting/sketch...”) of each test domain. However, in many real-world applications, such text information of test domains is not always available in advance. Moreover, even if we can verbalize all test domains, it is laborious for existing work [?] to train a different augmentation network for each possible unseen domain, which suffers from time-inefficiency. To overcome these challenges, we benefit from the multimodal embedding space of a pre-trained vision-language model and propose to acquire training-free and domain-invariant augmentations with text descriptions of arbitrary crafted unseen domains, which not necessarily match test domains. Beyond achieving state-of-the-art results, compared with existing works that require trainable augmentation networks, our approach is also notably more time-efficient, and exhibits a more solid theoretical support."

Related Material


[pdf] [supplementary material] [DOI]