Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding
Minh Tran*, Yelin Kim, Che-Chun Su, Min Sun, Cheng-Hao Kuo, Mohammad Soleymani
;
Abstract
"Self-supervised learning methods have demonstrated impressive performance across visual understanding tasks, including human behavior understanding. However, there has been limited work for self-supervised learning for egocentric social videos. Visual processing in such contexts faces several challenges, including noisy input, limited availability of egocentric social data, and the absence of pretrained models tailored to egocentric contexts. We propose , a novel framework leveraging novel-view face synthesis for dynamic perspective data augmentation from abundant exocentric videos and enhance self-supervised learning process for VideoMAE via: 1) reconstructing exocentric videos from masked dynamic perspective videos; and 2) predicting feature representations of a teacher model based on the corresponding exocentric frames. Experimental results demonstrate that consistently excels across diverse social role understanding tasks. It achieves state-of-the-art results in Ego4D’s Talk-to-me challenge (+0.7% mAP, +3.2% Accuracy). For the Look-at-me challenge, it achieves competitive performance with the state-of-the-art (-0.7% mAP, +1.5% Accuracy) without supervised training on external data. On the EasyCom dataset, our method surpasses both supervised Active Speaker Detection approaches and state-of-the-art video encoders (+1.2% mAP, +1.9% Accuracy compared to MARLIN)."
Related Material
[pdf]
[supplementary material]
[DOI]