ECVA | European Computer Vision Association

Merlin: Empowering Multimodal LLMs with Foresight Minds

En Yu, Liang Zhao, YANA WEI, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao* ;

Abstract

"Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin’s foresight minds with impressive performance on both future reasoning and visual comprehension tasks. Project page: https: //ahnsun.github.io/merlin."

Related Material

[pdf] [DOI]