Your Image Description

AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

1UC Berkeley, 2Stanford, 3UCL, 4Virginia Tech, 5Nvidia

Abstract

Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question--answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges.

Accident Vehicle Space Scenarios

Leaderboard of Benchmark Evaluation in the Accident Vehicles

Evaluation of the Accident Vehicles using using Short, Medium, and Long videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.
Difficulty Models Size Over. Avg. Short Video Scenarios Medium Video Scenarios Long Video Scenarios
Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent
Hard GPT 5 🥇 - 37.33 45.87 48.52 55.10 34.00 48.12 49.02 39.29 56.06 18.00 10.00 34.00 10.00
GPT 4o-24.4126.7834.6534.691135.7043.1432.1431.8211.006261
Gemini 2.5 Pro - 29.76 34.84 36.63 44.90 23.0 35.76 45.10 30.36 31.82 18.67 10.0 28.0 18.0
Gemini 2.5 flash think - 28.67 32.13 35.64 37.75 23.00 35.20 37.25 41.07 27.27 18.67 6.00 36.00 14.00
Gemini 2.5 flash no-think - 24.34 24.74 30.69 26.53 17.00 30.94 52.94 23.21 16.67 17.33 14.00 24.00 14.00
Gemini 1.5 Pro-18.7619.7223.7620.411524.5533.3316.0724.2412.002268
Claude 3.5 - 28.71 33.76 35.64 31.63 34.0 28.87 37.26 35.71 13.63 16.0 12.0 26.0 10.0
InternVL2.526B23.7821.3326.031.07.032.0046.032.018.018.0016.024.014.0
InternVL2.58B22.6720.0018.033.09.030.0046.030.014.018.0016.028.010.0
InternVL2.54B19.5618.6718.028.08.028.0034.024.026.012.008.022.06.0
LLaVA Next32B16.2220.6716.032.014.011.3312.012.010.016.6710.030.010.0
LLaVA Video7B19.7819.3312.035.011.024.6726.030.018.015.3310.028.08.0
LLaVA OneVision7B13.6714.335.027.011.014.6718.08.018.012.06.022.08.0
Qwen2.5 VL32B22.6619.3311.034.013.035.3346.024.036.013.334.026.010.0
Qwen2.5 VL7B22.8926.0017.030.031.030.0040.032.018.012.672.030.06.0
Medium GPT 5 🥇 - 48.34 62.55 64.65 67.00 56.00 46.48 50.00 42.22 47.22 36.00 24.00 56.00 28.00
GPT 4o-36.9945.4948.48553333.8941.6726.6733.3331.33244426
Gemini 2.5 Pro - 36.46 42.79 38.38 59.0 31.0 33.93 39.58 28.89 33.33 32.67 28.0 44.0 26.0
Gemini 2.5 flash think - 37.52 47.82 46.47 56.00 41.00 36.99 43.75 42.22 25.00 28.00 12.00 44.00 28.00
Gemini 2.5 flash no-think - 36.70 47.50 48.49 58.00 36.00 33.93 39.58 28.89 33.33 28.67 24.00 42.00 20.00
Gemini 1.5 Pro-33.8939.4742.42423433.5233.3342.222528.67125222
Claude 3.5 - 35.35 41.78 35.35 50.0 40.0 35.60 39.58 42.22 25.0 28.67 16.0 44.0 26.0
InternVL2.526B35.1136.0039.050.019.036.6750.036.024.032.6730.040.028.0
InternVL2.58B34.6637.3343.057.012.035.3342.046.018.031.3326.044.024.0
InternVL2.54B33.8939.6738.053.028.032.6744.028.026.029.3316.046.026.0
LLaVA Next32B20.027.3316.049.017.010.6714.010.08.022.016.036.014.0
LLaVA Video7B25.6725.0020.034.026.028.6736.028.022.023.3314.040.016.0
LLaVA OneVision7B16.6716.0026.030.016.014.6718.08.018.019.3312.030.016.0
Qwen2.5 VL32B28.5528.3321.044.020.033.3340.030.030.024.008.040.024.0
Qwen2.5 VL7B29.8939.0037.042.038.030.6732.040.020.020.0016.026.018.0
Easy GPT 5 🥇 - 54.86 71.20 76.00 69.61 68.00 48.71 47.06 44.90 54.17 44.67 34.00 52.00 48.00
GPT 4o-42.1752.355947.065147.1654.944.941.6727.0044532
Gemini 2.5 Pro - 54.56 62.96 70.0 55.88 63.0 54.73 52.94 59.18 52.08 46.00 40.0 54.0 44.0
Gemini 2.5 flash think - 50.00 67.56 69.00 65.69 68.00 44.45 52.94 40.82 39.58 38.00 32.00 38.00 44.00
Gemini 2.5 flash no-think - 51.40 58.97 70.00 54.90 52.00 46.56 52.94 36.74 50.00 48.67 38.00 56.00 52.00
Gemini 1.5 Pro-46.0051.3360504436.9249.0236.732550.00584448
Claude 3.5 - 48.59 60.33 61.0 50.0 70.0 36.35 35.29 51.02 22.73 49.33 64.0 44.0 40.0
InternVL2.526B52.5561.0062.059.062.045.3358.044.034.051.3362.062.030.0
InternVL2.58B50.1155.6755.060.052.044.6758.042.034.050.0054.064.032.0
InternVL2.54B44.8953.3346.060.054.037.3348.038.026.044.0044.048.040.0
LLaVA Next32B31.2538.0035.045.034.021.3312.014.038.034.6720.050.034.0
LLaVA Video7B31.4433.0030.031.038.033.3338.036.026.028.0016.032.036.0
LLaVA OneVision7B29.7832.0031.033.032.024.0026.030.016.033.3328.036.036.0
Qwen2.5 VL32B43.2251.0058.050.045.041.3346.038.040.037.3332.044.036.0
Qwen2.5 VL7B40.6751.3355.042.057.036.0032.042.034.034.6734.028.042.0

Airplane Navigation Scenarios

Leaderboard of Benchmark Evaluation in the Airplane

Evaluation of the Airplane Navigation using using Short, Medium, and Long videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.
Difficulty Models Size Over. Avg. Short Video Scenarios Medium Video Scenarios Long Video Scenarios
Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent
Hard GPT 5 - 28.11 26.67 18.00 30.00 32.00 26.00 32.00 34.00 12.00 31.67 30.00 20.00 45.00
GPT 4o-18.1121.3316.0026.0022.0014.6712.0030.002.0018.335.0035.0015.00
Gemini 2.5 Pro 🥇 - 31.39 32.83 36.0 24.49 38.0 24.67 32.0 22.0 20.0 36.67 30.0 15.0 65.0
Gemini 2.5 flash think - 25.78 26.00 26.00 18.00 34.00 21.33 28.00 18.00 18.00 30.00 30.00 10.00 50.00
Gemini 2.5 flash no-think - 25.44 25.33 22.00 28.00 26.00 26.00 26.00 28.00 24.00 25.00 0.00 40.00 35.00
Gemini 1.5 Pro-22.3426.6724.0026.0030.0018.6720.0022.0014.0021.6710.0025.0030.00
Claude 3.5 - 24.22 26.00 18.0 32.0 28.0 23.33 20.0 28.0 22.0 23.33 10.0 40.0 20.0
InternVL2.526B17.3319.3324.0026.0010.0019.3316.0032.0010.0013.3310.0010.0020.00
InternVL2.58B18.2218.6720.0028.008.0019.3316.0030.0012.0016.675.0035.0010.00
InternVL2.54B15.3315.3314.0010.0022.0014.0016.0018.008.0016.6715.0030.005.00
LLaVA Next32B17.8918.6714.034.08.016.676.032.012.018.335.040.010.0
LLaVA Video7B14.7816.6714.0028.008.0012.676.0022.0010.0015.005.0030.0010.00
LLaVA OneVision7B15.6716.0012.0028.008.0016.0012.0026.0010.0015.0010.0025.0010.00
Qwen2.5 VL32B16.2220.006.0036.0018.0015.334.0024.0018.0013.330.0030.0010.00
Qwen2.5 VL7B16.5519.330.0030.0028.0015.332.0030.0014.0015.005.0030.0010.00
Medium GPT 5 - 44.00 39.33 28.00 44.00 46.00 39.33 36.00 54.00 28.00 53.33 65.00 35.00 60.00
GPT 4o-38.4538.6738.0056.0022.0030.0038.0034.0018.0046.6765.0030.0045.00
Gemini 2.5 Pro - 43.11 44.67 42.0 40.0 52.0 31.33 34.0 34.0 26.0 53.33 60.0 35.0 65.0
Gemini 2.5 flash think - 39.78 39.33 32.00 38.00 48.00 30.00 34.00 28.00 28.00 50.00 65.00 15.00 70.00
Gemini 2.5 flash no-think 🥇 - 49.67 43.33 30.00 48.00 52.00 40.67 38.00 50.00 34.00 65.00 60.00 65.00 70.00
Gemini 1.5 Pro-38.7838.0032.0048.0034.0036.6734.0052.0024.0041.6730.0055.0040.00
Claude 3.5 - 39.67 38.00 26.0 40.0 48.0 36.00 32.0 54.0 22.0 45.00 50.0 35.0 50.0
InternVL2.526B28.6731.3328.0058.008.0024.6712.0050.0012.0030.0025.0045.0020.00
InternVL2.58B34.3330.0020.0058.0012.0034.6732.0050.0022.0038.3340.0045.0030.00
InternVL2.54B32.2229.3328.0044.0016.0034.0030.0054.0018.0033.3335.0040.0025.00
LLaVA Next32B26.1124.6718.040.016.025.3318.040.018.028.3325.040.020.0
LLaVA Video7B24.0025.3324.0036.0016.0020.0016.0026.0018.0026.6715.0045.0020.00
LLaVA OneVision7B23.6723.3320.0034.0016.0022.6720.0032.0016.0025.0020.0035.0020.00
Qwen2.5 VL32B33.3432.6712.0048.0038.0030.6722.0050.0020.0036.6720.0060.0030.00
Qwen2.5 VL7B28.0024.6716.0024.0034.0026.0024.0026.0028.0033.3335.0020.0045.00
Easy GPT 5 - 52.00 47.33 42.00 42.00 58.00 48.67 46.00 46.00 54.00 60.00 65.00 35.00 80.00
GPT 4o-40.6735.3330.0028.0048.0036.6724.0038.0048.0050.0045.0050.0055.00
Gemini 2.5 Pro 🥇 - 52.56 56.00 60.0 48.0 60.0 40.00 40.0 36.0 44.0 61.67 75.0 35.0 75.0
Gemini 2.5 flash think - 50.67 49.33 40.00 46.00 62.00 46.00 46.00 44.00 48.00 56.67 55.00 40.00 75.00
Gemini 2.5 flash no-think - 50.78 49.33 36.00 52.00 60.00 48.00 40.00 50.00 54.00 55.00 60.00 50.00 55.00
Gemini 1.5 Pro-43.0045.3336.0044.0056.0042.0048.0032.0046.0041.6735.0050.0040.00
Claude 3.5 - 42.45 38.00 34.0 38.0 42.0 42.67 30.0 56.0 42.0 46.67 40.0 45.0 55.0
InternVL2.526B36.1135.3336.0044.0026.0034.6728.0046.0030.0038.3330.0040.0045.00
InternVL2.58B38.4436.6728.0046.0036.0035.3332.0042.0032.0043.3360.0040.0030.00
InternVL2.54B40.3343.3342.0050.0038.0039.3330.0044.0044.0038.3335.0060.0020.00
LLaVA Next32B33.2236.6736.0042.032.031.3336.032.026.031.6735.030.030.0
LLaVA Video7B33.2233.3334.0038.0028.0034.6734.0038.0032.0031.6735.0030.0030.00
LLaVA OneVision7B33.2233.3334.0038.0028.0034.6734.0038.0032.0031.6735.0030.0030.00
Qwen2.5 VL32B52.4550.0034.0056.0060.0050.6740.0054.0058.0056.6755.0060.0055.00
Qwen2.5 VL7B39.8933.3328.0018.0054.0038.0048.0016.0050.0048.3355.0030.0060.00

Ship Motion Scenarios

Leaderboard of Benchmark Evaluation in the Ship Motion

Evaluation of the Ship Motion using using River and Ocean videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, indicating a gradual rise in task difficulty.
Difficulty Models Size Over. Avg. River Scenarios Ocean Scenarios
Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent
Hard GPT 5 🥇 - 38.36 48.72 46.15 30.77 69.23 28.00 30.00 32.00 22.00
GPT 4o-22.1028.2038.4626.9219.2316.0018.0018.0012.00
Gemini 2.5 Pro - 29.64 34.62 23.08 34.62 46.15 24.67 38.0 16.0 20.0
Gemini 2.5 flash think - 27.36 32.05 30.77 26.92 38.46 22.67 30.00 22.00 16.00
Gemini 2.5 flash no-think - 27.44 28.21 42.31 19.23 23.08 26.67 36.00 20.00 24.00
Gemini 1.5 Pro-26.0226.9223.0830.7726.9225.1134.0020.9320.41
Claude 3.5 - 25.44 28.20 19.23 19.23 46.15 22.67 26.0 22.0 20.0
InternVL2.526B22.5423.0815.3819.2334.6222.0018.0028.0020.00
InternVL2.58B21.9021.797.6926.9230.7722.0016.0028.0022.00
InternVL2.54B20.9220.5119.2319.2323.0821.3316.0026.0022.00
LLaVA Next32B14.3911.547.6919.237.6915.338.030.08.0
LLaVA Video7B14.0016.6715.3823.0811.5411.338.0020.006.00
LLaVA OneVision7B15.6716.6711.5426.9211.5414.678.0028.008.00
Qwen2.5 VL32B13.3914.107.6923.0811.5412.678.024.06.0
Qwen2.5 VL7B14.6716.677.6930.7711.5412.676.0024.008.00
Medium GPT 5 🥇 - 51.80 60.26 53.85 46.15 80.77 43.33 56.00 48.00 26.00
GPT 4o-38.4942.3150.0053.8523.0834.6736.0048.0020.00
Gemini 2.5 Pro - 41.77 44.87 30.77 61.54 42.31 38.67 48.0 46.0 22.0
Gemini 2.5 flash think - 48.26 53.85 61.54 57.70 42.31 42.67 52.00 42.00 34.00
Gemini 2.5 flash no-think - 46.12 50.00 46.15 57.69 46.15 42.00 56.00 44.00 26.00
Gemini 1.5 Pro-46.3153.8446.1565.3850.0038.7834.0049.0233.33
Claude 3.5 - 38.62 35.90 34.62 50.0 23.08 41.33 42.0 54.0 28.0
InternVL2.526B41.7744.8730.7757.6946.1538.6724.0062.0030.00
InternVL2.58B41.0846.1534.6261.5442.3136.0034.0060.0014.00
InternVL2.54B44.3648.7223.0865.3857.6940.0028.0060.0032.00
LLaVA Next32B20.8823.0811.5438.4619.2318.6710.0030.0016.00
LLaVA Video7B21.9220.5119.2326.9215.3823.3320.0030.0020.00
LLaVA OneVision7B22.5423.0819.2330.7719.2322.0014.0034.0018.00
Qwen2.5 VL32B33.3134.6219.2350.0034.6232.0020.0050.0026.00
Qwen2.5 VL7B24.0829.4919.2330.7738.4618.6718.0026.0012.00
Easy GPT 5 🥇 - 63.00 66.67 61.54 50.00 88.46 59.33 78.00 48.00 52.00
GPT 4o-50.5157.6957.6950.0065.3843.3366.0034.0030.00
Gemini 2.5 Pro - 61.05 64.10 57.69 57.69 76.92 58.00 72.0 50.0 52.0
Gemini 2.5 flash think - 62.03 65.39 80.77 42.31 73.08 58.67 70.00 52.00 54.00
Gemini 2.5 flash no-think - 58.18 57.69 57.69 38.46 76.92 58.67 80.00 42.00 54.00
Gemini 1.5 Pro-50.6952.5642.3161.5453.8548.8150.0046.4350.00
Claude 3.5 - 49.39 47.44 50.0 53.85 38.46 51.33 62.0 52.0 40.0
InternVL2.526B55.0564.1065.3857.6969.2346.0050.0050.0038.00
InternVL2.58B53.4760.2669.2346.1565.3846.6746.0054.0040.00
InternVL2.54B53.8756.4153.8557.6957.6951.3352.0056.0046.00
LLaVA Next32B35.5937.1826.9253.8530.7734.0030.0038.0034.00
LLaVA Video7B31.0332.0530.7734.6230.7730.0022.0038.0030.00
LLaVA OneVision7B33.0033.3334.6234.6230.7732.6728.0038.0032.00
Qwen2.5 VL32B52.7761.5453.8561.5469.2344.0040.0054.0038.00
Qwen2.5 VL7B31.3134.6238.4619.2346.1528.0036.0022.0026.00

Opportunities!

  • 1. Long-Horizon Temporal Reasoning: Future work can explore models with stronger memory and temporal abstraction capabilities to handle long-duration videos with multiple events and delayed causal effects, especially in air and water scenarios.
  • 2. Generalization Across Domains and Modalities: Cross-domain generalization—from land to air or water—and transfer learning across modalities (e.g., combining video, text, and audio) remain underexplored and crucial for building versatile systems.
  • 3. Safety-Aware and Verifiable Reasoning: Given the high-stakes nature of open-space applications (e.g., autonomous driving or aircraft control), future benchmarks and methods should integrate safety constraints and provide interpretable or verifiable reasoning processes.