DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model

1New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 2School of Artificial Intelligence, University of Chinese Academy of Sciences, 3Meituan Inc., 4Centre for Artificial Intelligence and Robotics, HKISI, CAS
*Work done during an internship at Meituan. Equal contributions. Corresponding author.

Abstract

Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.

DrivingDojo Constitution

Dataset Videos Type Camera Ego Trajectory Text Description
DrivingDojo 18.2k total
DrivingDojo-Action 7.9k rich ego-actions
DrivingDojo-Interplay 6.4k multi-agent interplay
DrivingDojo-Open 3.9k open-world knowledge

DrivingDojo-Action

To enable the world model to generate an infinite number of high-fidelity, action-controllable virtual driving environments, we create a subset called DrivingDojo-Action that features a balanced distribution of driving maneuvers. This subset includes a diverse range of both longitudinal maneuvers, such as acceleration, deceleration, emergency braking, and stop-and-go driving, as well as lateral maneuvers, including lane-changing and lane-keeping.

Turn Left

Go Straight

Turn Right

Lane Change

Emergency Braking

DrivingDojo-Interplay

We design the DrivingDojo-Interplay subset focusing on interactions with dynamic agents as a core component of the dataset. We curate this subset to include at least one of the following driving scenarios: cutting in/off, meeting, blocked, overtaking, and being overtaken.

DrivingDojo-Open

We place a unique emphasis on including rich open-world knowledge video clips and construct the DrivingDojo-Open subset. Describing open-world driving knowledge is challenging due to its complexity and variability, but these scenarios are crucial for ensuring safe driving.

Action Instruction Following (AIF)

We propose the action instruction following (AIF) errors to measure the consistency between the generated video and the input action conditions.

Generation Demos

We show the model generation demos trained on the DrivingDojo dataset. Our model can generate high-resolution, complex driving scenarios.

Diverse-action Generation

Crossing

Lane Changing

Diverse-scene Generation

Multi-agent Interplay

Out-of-domain Generation

Open-world Generation

Other Modality Condition

Acknowledgements

This work was supported in part by the National Key R&D Program of China (No. 2022ZD0116500), the National Natural Science Foundation of China (No. U21B2042, No. 62320106010), and in part by the 2035 Innovation Program of CAS, and the InnoHK program.