ECVA | European Computer Vision Association

Dolphins: Multimodal Language Model for Driving

Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, Chaowei Xiao* ;

Abstract

"The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce , a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance ’s reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process in the general domain. Then we tailored to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of are characterised into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection. The anonymous demo is available at https: //vlm-driver.github.io/."

Related Material

[pdf] [supplementary material] [DOI]