ECVA | European Computer Vision Association

Video Question Answering with Procedural Programs

Rohan Choudhury*, Koichiro Niinuma, Kris Kitani, Laszlo A Jeni ;

Abstract

"We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present ˙ which uses a large language model to generate Procedural Video Querying (), such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ẇith novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ṫo perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ȧchieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023/."

Related Material

[pdf] [supplementary material] [DOI]