Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition
Yisong Wang, Nan Xi*, Jingjing Meng, Junsong Yuan
;
Abstract
"Understanding human-object interaction (HOI) in videos represents a fundamental yet intricate challenge in computer vision, requiring perception and reasoning across both spatial and temporal domains. Despite previous success of object detection and tracking, multi-person video HOI recognition still faces two major challenges: (1) the three facets of HOI (human, objects, and the interactions that bind them) exhibit interconnectedness and exert mutual influence upon one another. (2) the complexity of multi-person multi-object combinations in spatio-temporal interaction. To address them, we design a spatio-temporal context fuser to better model the interactions among persons and objects in videos. Furthermore, to equip the model with temporal reasoning capacity, we propose an interaction state reasoner module on top of context fuser. Considering the interaction is a key element to bind human and object, we propose an interaction-centric hypersphere in the feature embedding space to model each category of interaction. It helps to learn the distribution of HOI samples belonging to the same interactions on the hypersphere. After training, each interaction prototype sphere will fit the testing HOI sample to determine the HOI classification result. Empirical results on multi-person video HOI dataset MPHOI-72 indicate that our method remarkably surpasses state-of-the-art (SOTA) method by more than 22% F1 score. At the same time, on single-person datasets Bimanual Actions (single-human two-hand HOI) and CAD-120 (single-human HOI), our method achieves on par or even better results compared with SOTA methods. Source code is released at the following link: https://github.com/ southnx/IcH-Vid-HOI."
Related Material
[pdf]
[supplementary material]
[DOI]