Two-Stage Active Learning for Efficient Temporal Action Segmentation

Yuhao Su, Ehsan Elhamifar* ;

Abstract


"Training a temporal action segmentation (TAS) model on long and untrimmed videos requires gathering framewise video annotations, which is very costly. We propose a two-stage active learning framework to efficiently learn a TAS model using only a small amount of video annotations. Our framework consists of three components that work together in each active learning iteration. 1) Using current labeled frames, we learn a TAS model and action prototypes using a novel contrastive learning method. Leveraging prototypes not only enhances the model performance, but also increases the computational efficiency of both video and frame selection for labeling, which are the next components of our framework. 2) Using the currently learned TAS model and action prototypes, we select informative unlabeled videos for annotation. To do so, we find unlabeled videos that have low alignment scores to learned action prototype sequences in labeled videos. 3) To annotate a small subset of informative frames in each selected unlabeled video, we propose a video-aligned summary selection method and an efficient greedy search algorithm. By evaluation on four benchmark datasets (50Salads, GTEA, Breakfast, CrossTask), we show that our method significantly reduces the annotation costs, while consistently surpassing baselines over active learning iterations. Our method achieves comparable or better performance than other weakly supervised methods while using a small amount of labeled frames. We further extend our framework to a semi-supervised active learning setting. To the best of our knowledge, this is the first work studying active learning for TAS."

Related Material


[pdf] [supplementary material] [DOI]