DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model

NeurIPS 2024

Yuqi Wang^1,2†*, Ke Cheng^3†, Jiawei He^1,2†, Qitai Wang^1,2†, Hengchen Dai³, Yuntao Chen^4✉, Fei Xia³ Zhaoxiang Zhang^1,2,4

¹New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, ²School of Artificial Intelligence, University of Chinese Academy of Sciences, ³Meituan Inc., ⁴Centre for Artificial Intelligence and Robotics, HKISI, CAS

^*Work done during an internship at Meituan. ^†Equal contributions. ^✉Corresponding author.

Paper arXiv Code Data-mini Dataset

Abstract

Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.

DrivingDojo Constitution

Dataset	Videos	Type	Camera	Ego Trajectory	Text Description
DrivingDojo	18.2k	total	✓	✓	✓
DrivingDojo-Action	7.9k	rich ego-actions	✓	✓
DrivingDojo-Interplay	6.4k	multi-agent interplay	✓	✓
DrivingDojo-Open	3.9k	open-world knowledge	✓	✓	✓

DrivingDojo-Action

To enable the world model to generate an infinite number of high-fidelity, action-controllable virtual driving environments, we create a subset called DrivingDojo-Action that features a balanced distribution of driving maneuvers. This subset includes a diverse range of both longitudinal maneuvers, such as acceleration, deceleration, emergency braking, and stop-and-go driving, as well as lateral maneuvers, including lane-changing and lane-keeping.

Turn Left

Go Straight

Turn Right

Lane Change

Emergency Braking

DrivingDojo-Interplay

We design the DrivingDojo-Interplay subset focusing on interactions with dynamic agents as a core component of the dataset. We curate this subset to include at least one of the following driving scenarios: cutting in/off, meeting, blocked, overtaking, and being overtaken.

DrivingDojo-Open

We place a unique emphasis on including rich open-world knowledge video clips and construct the DrivingDojo-Open subset. Describing open-world driving knowledge is challenging due to its complexity and variability, but these scenarios are crucial for ensuring safe driving.

Action Instruction Following (AIF)

We propose the action instruction following (AIF) errors to measure the consistency between the generated video and the input action conditions.

Generation Demos

We show the model generation demos trained on the DrivingDojo dataset. Our model can generate high-resolution, complex driving scenarios.

Diverse-action Generation

Crossing

Lane Changing

Diverse-scene Generation

Multi-agent Interplay

Out-of-domain Generation

Open-world Generation

Other Modality Condition

Acknowledgements

This work was supported in part by the National Key R&D Program of China (No. 2022ZD0116500), the National Natural Science Foundation of China (No. U21B2042, No. 62320106010), and in part by the 2035 Innovation Program of CAS, and the InnoHK program.