Learning Representations from Foundation Models for Domain Generalized Stereo Matching
Yongjian Zhang, Longguang Wang, Kunhong Li, WANG Yun, Yulan Guo*
;
Abstract
"State-of-the-art stereo matching networks trained on in-domain data often underperform on cross-domain scenes. Intuitively, leveraging the zero-shot capacity of a foundation model can alleviate the cross-domain generalization problem. The main challenge of incorporating a foundation model into stereo matching pipeline lies in the absence of an effective forward process from single-view coarse-grained tokens to cross-view fine-grained cost representations. In this paper, we propose FormerStereo, a general framework that integrates the Vision Transformer (ViT) based foundation model into the stereo matching pipeline. Using this framework, we transfer the all-purpose features to matching-specific ones. Specifically, we propose a reconstruction-constrained decoder to retrieve fine-grained representations from coarse-grained ViT tokens. To maintain cross-view consistent representations, we propose a cosine-constrained concatenation cost (C4) space to construct cost volumes. We integrate FormerStereo with state-of-the-art (SOTA) stereo matching networks and evaluate its effectiveness on multiple benchmark datasets. Experiments show that the FormerStereo framework effectively improves the zero-shot performance of existing stereo matching networks on unseen domains and achieves SOTA performance."
Related Material
[pdf]
[supplementary material]
[DOI]