Scenes-Objects-Actions: A Multi-Task, Multi-Label Video Dataset

Jamie Ray, Heng Wang, Du Tran, Yufei Wang, Matt Feiszli, Lorenzo Torresani, Manohar Paluri; The European Conference on Computer Vision (ECCV), 2018, pp. 635-651

Abstract


This paper introduces a large-scale, multi-label and multitask video dataset named Scenes-Objects-Actions (SOA). Most prior video datasets are based on a predened taxonomy, which is used to de- ne the keyword queries issued to search engines. The videos retrieved by the search engines are then veried for correctness by human annotators. Datasets collected in this manner tend to generate high classication accuracy as search engines typically rank easy" videos rst. The SOA dataset adopts a dierent approach. We rely on uniform sampling to get a better representations of videos on the Web. Trained annotators are asked to provide free-form text labels describing each video in three dierent aspects: scene, object and action. These raw labels are then merged, split and renamed to generate a taxonomy for SOA. All the annotations are veried again based on the taxonomy. The nal dataset includes 562K videos with 3.64M annotations spanning 49 categories for scenes, 356 for objects, and 148 for actions. We show that datasets collected in this way are quite challenging by evaluating existing popular video models on SOA. We provide in-depth analysis about the performance of dierent models on SOA, and highlight potential new directions in video classication. A key-feature of SOA is that it enables the empirical study of correlation among scene, object and action recognition in video. We present results of this study and further analyze the potential of using the information learned from one task to improve the others. We compare SOA with existing datasets in the context of transfer learning and demonstrate that pre-training on SOA consistently improves the accuracy on a wide variety of datasets. We believe that the challenges presented by SOA oer the opportunity for further advancement in video analysis as we progress from single-label classication towards a more comprehensive understanding of video data.

Related Material


[pdf]
[bibtex]
@InProceedings{Ray_2018_ECCV,
author = {Ray, Jamie and Wang, Heng and Tran, Du and Wang, Yufei and Feiszli, Matt and Torresani, Lorenzo and Paluri, Manohar},
title = {Scenes-Objects-Actions: A Multi-Task, Multi-Label Video Dataset},
booktitle = {The European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}
}