Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.
Dataset | Videos | Type | Camera | Ego Trajectory | Text Description |
---|---|---|---|---|---|
DrivingDojo | 18.2k | total | ✓ | ✓ | ✓ |
DrivingDojo-Action | 7.9k | rich ego-actions | ✓ | ✓ | |
DrivingDojo-Interplay | 6.4k | multi-agent interplay | ✓ | ✓ | |
DrivingDojo-Open | 3.9k | open-world knowledge | ✓ | ✓ | ✓ |
To enable the world model to generate an infinite number of high-fidelity, action-controllable virtual driving environments, we create a subset called DrivingDojo-Action that features a balanced distribution of driving maneuvers. This subset includes a diverse range of both longitudinal maneuvers, such as acceleration, deceleration, emergency braking, and stop-and-go driving, as well as lateral maneuvers, including lane-changing and lane-keeping.
We design the DrivingDojo-Interplay subset focusing on interactions with dynamic agents as a core component of the dataset. We curate this subset to include at least one of the following driving scenarios: cutting in/off, meeting, blocked, overtaking, and being overtaken.
We place a unique emphasis on including rich open-world knowledge video clips and construct the DrivingDojo-Open subset. Describing open-world driving knowledge is challenging due to its complexity and variability, but these scenarios are crucial for ensuring safe driving.
We propose the action instruction following (AIF) errors to measure the consistency between the generated video and the input action conditions.
We show the model generation demos trained on the DrivingDojo dataset. Our model can generate high-resolution, complex driving scenarios.
This work was supported in part by the National Key R&D Program of China (No. 2022ZD0116500), the National Natural Science Foundation of China (No. U21B2042, No. 62320106010), and in part by the 2035 Innovation Program of CAS, and the InnoHK program.