AccidentBench: Benchmarking Multi-Modal Understanding and Reasoning in Vehicle Accidents and Beyond

AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

¹UC Berkeley, ²Stanford, ³UCL, ⁴Virginia Tech, ⁵Nvidia

Abstract

Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question--answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges.

Evaluation of the **Accident Vehicles** using using **Short**, **Medium**, and **Long** videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.
Difficulty	Models	Size	Over. Avg.	Short Video Scenarios	Medium Video Scenarios	Long Video Scenarios
Hard	GPT 5 🥇	-	37.33	45.87	48.52	55.10	34.00	48.12	49.02	39.29	56.06	18.00	10.00	34.00	10.00
GPT 4o	-	24.41	26.78	34.65	34.69	11	35.70	43.14	32.14	31.82	11.00	6	26	1
Gemini 2.5 Pro	-	29.76	34.84	36.63	44.90	23.0	35.76	45.10	30.36	31.82	18.67	10.0	28.0	18.0
Gemini 2.5 flash think	-	28.67	32.13	35.64	37.75	23.00	35.20	37.25	41.07	27.27	18.67	6.00	36.00	14.00
Gemini 2.5 flash no-think	-	24.34	24.74	30.69	26.53	17.00	30.94	52.94	23.21	16.67	17.33	14.00	24.00	14.00
Gemini 1.5 Pro	-	18.76	19.72	23.76	20.41	15	24.55	33.33	16.07	24.24	12.00	2	26	8
Claude 3.5	-	28.71	33.76	35.64	31.63	34.0	28.87	37.26	35.71	13.63	16.0	12.0	26.0	10.0
InternVL2.5	26B	23.78	21.33	26.0	31.0	7.0	32.00	46.0	32.0	18.0	18.00	16.0	24.0	14.0
InternVL2.5	8B	22.67	20.00	18.0	33.0	9.0	30.00	46.0	30.0	14.0	18.00	16.0	28.0	10.0
InternVL2.5	4B	19.56	18.67	18.0	28.0	8.0	28.00	34.0	24.0	26.0	12.00	8.0	22.0	6.0
LLaVA Next	32B	16.22	20.67	16.0	32.0	14.0	11.33	12.0	12.0	10.0	16.67	10.0	30.0	10.0
LLaVA Video	7B	19.78	19.33	12.0	35.0	11.0	24.67	26.0	30.0	18.0	15.33	10.0	28.0	8.0
LLaVA OneVision	7B	13.67	14.33	5.0	27.0	11.0	14.67	18.0	8.0	18.0	12.0	6.0	22.0	8.0
Qwen2.5 VL	32B	22.66	19.33	11.0	34.0	13.0	35.33	46.0	24.0	36.0	13.33	4.0	26.0	10.0
Qwen2.5 VL	7B	22.89	26.00	17.0	30.0	31.0	30.00	40.0	32.0	18.0	12.67	2.0	30.0	6.0

Medium	GPT 5 🥇	-	48.34	62.55	64.65	67.00	56.00	46.48	50.00	42.22	47.22	36.00	24.00	56.00	28.00
GPT 4o	-	36.99	45.49	48.48	55	33	33.89	41.67	26.67	33.33	31.33	24	44	26
Gemini 2.5 Pro	-	36.46	42.79	38.38	59.0	31.0	33.93	39.58	28.89	33.33	32.67	28.0	44.0	26.0
Gemini 2.5 flash think	-	37.52	47.82	46.47	56.00	41.00	36.99	43.75	42.22	25.00	28.00	12.00	44.00	28.00
Gemini 2.5 flash no-think	-	36.70	47.50	48.49	58.00	36.00	33.93	39.58	28.89	33.33	28.67	24.00	42.00	20.00
Gemini 1.5 Pro	-	33.89	39.47	42.42	42	34	33.52	33.33	42.22	25	28.67	12	52	22
Claude 3.5	-	35.35	41.78	35.35	50.0	40.0	35.60	39.58	42.22	25.0	28.67	16.0	44.0	26.0
InternVL2.5	26B	35.11	36.00	39.0	50.0	19.0	36.67	50.0	36.0	24.0	32.67	30.0	40.0	28.0
InternVL2.5	8B	34.66	37.33	43.0	57.0	12.0	35.33	42.0	46.0	18.0	31.33	26.0	44.0	24.0
InternVL2.5	4B	33.89	39.67	38.0	53.0	28.0	32.67	44.0	28.0	26.0	29.33	16.0	46.0	26.0
LLaVA Next	32B	20.0	27.33	16.0	49.0	17.0	10.67	14.0	10.0	8.0	22.0	16.0	36.0	14.0
LLaVA Video	7B	25.67	25.00	20.0	34.0	26.0	28.67	36.0	28.0	22.0	23.33	14.0	40.0	16.0
LLaVA OneVision	7B	16.67	16.00	26.0	30.0	16.0	14.67	18.0	8.0	18.0	19.33	12.0	30.0	16.0
Qwen2.5 VL	32B	28.55	28.33	21.0	44.0	20.0	33.33	40.0	30.0	30.0	24.00	8.0	40.0	24.0
Qwen2.5 VL	7B	29.89	39.00	37.0	42.0	38.0	30.67	32.0	40.0	20.0	20.00	16.0	26.0	18.0

Easy	GPT 5 🥇	-	54.86	71.20	76.00	69.61	68.00	48.71	47.06	44.90	54.17	44.67	34.00	52.00	48.00
GPT 4o	-	42.17	52.35	59	47.06	51	47.16	54.9	44.9	41.67	27.00	44	5	32
Gemini 2.5 Pro	-	54.56	62.96	70.0	55.88	63.0	54.73	52.94	59.18	52.08	46.00	40.0	54.0	44.0
Gemini 2.5 flash think	-	50.00	67.56	69.00	65.69	68.00	44.45	52.94	40.82	39.58	38.00	32.00	38.00	44.00
Gemini 2.5 flash no-think	-	51.40	58.97	70.00	54.90	52.00	46.56	52.94	36.74	50.00	48.67	38.00	56.00	52.00
Gemini 1.5 Pro	-	46.00	51.33	60	50	44	36.92	49.02	36.73	25	50.00	58	44	48
Claude 3.5	-	48.59	60.33	61.0	50.0	70.0	36.35	35.29	51.02	22.73	49.33	64.0	44.0	40.0
InternVL2.5	26B	52.55	61.00	62.0	59.0	62.0	45.33	58.0	44.0	34.0	51.33	62.0	62.0	30.0
InternVL2.5	8B	50.11	55.67	55.0	60.0	52.0	44.67	58.0	42.0	34.0	50.00	54.0	64.0	32.0
InternVL2.5	4B	44.89	53.33	46.0	60.0	54.0	37.33	48.0	38.0	26.0	44.00	44.0	48.0	40.0
LLaVA Next	32B	31.25	38.00	35.0	45.0	34.0	21.33	12.0	14.0	38.0	34.67	20.0	50.0	34.0
LLaVA Video	7B	31.44	33.00	30.0	31.0	38.0	33.33	38.0	36.0	26.0	28.00	16.0	32.0	36.0
LLaVA OneVision	7B	29.78	32.00	31.0	33.0	32.0	24.00	26.0	30.0	16.0	33.33	28.0	36.0	36.0
Qwen2.5 VL	32B	43.22	51.00	58.0	50.0	45.0	41.33	46.0	38.0	40.0	37.33	32.0	44.0	36.0
Qwen2.5 VL	7B	40.67	51.33	55.0	42.0	57.0	36.00	32.0	42.0	34.0	34.67	34.0	28.0	42.0

Evaluation of the Accident Vehicles using using Short, Medium, and Long videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.

Difficulty

Models

Size

Over. Avg.

Short Video Scenarios

Medium Video Scenarios

Long Video Scenarios

Avg.

Temporal

Spatial

Intent

Avg.

Temporal

Spatial

Intent

Avg.

Temporal

Spatial

Intent

Hard

GPT 5 🥇

37.33

45.87

48.52

55.10

34.00

48.12

49.02

39.29

56.06

18.00

10.00

34.00

10.00

GPT 4o

24.41

26.78

34.65

34.69

35.70

43.14

32.14

31.82

11.00

Gemini 2.5 Pro

29.76

34.84

36.63

44.90

23.0

35.76

45.10

30.36

31.82

18.67

10.0

28.0

18.0

Gemini 2.5 flash think

28.67

32.13

35.64

37.75

23.00

35.20

37.25

41.07

27.27

18.67

6.00

36.00

14.00

Gemini 2.5 flash no-think

24.34

24.74

30.69

26.53

17.00

30.94

52.94

23.21

16.67

17.33

14.00

24.00

14.00

Gemini 1.5 Pro

18.76

19.72

23.76

20.41

24.55

33.33

16.07

24.24

12.00

Claude 3.5

28.71

33.76

35.64

31.63

34.0

28.87

37.26

35.71

13.63

16.0

12.0

26.0

10.0

InternVL2.5

26B

23.78

21.33

26.0

31.0

7.0

32.00

46.0

32.0

18.0

18.00

16.0

24.0

14.0

InternVL2.5

22.67

20.00

18.0

33.0

9.0

30.00

46.0

30.0

14.0

18.00

16.0

28.0

10.0

InternVL2.5

19.56

18.67

18.0

28.0

8.0

28.00

34.0

24.0

26.0

12.00

8.0

22.0

6.0

LLaVA Next

32B

16.22

20.67

16.0

32.0

14.0

11.33

12.0

10.0

16.67

10.0

30.0

10.0

LLaVA Video

19.78

19.33

12.0

35.0

11.0

24.67

26.0

30.0

18.0

15.33

10.0

28.0

8.0

LLaVA OneVision

13.67

14.33

5.0

27.0

11.0

14.67

18.0

8.0

18.0

12.0

6.0

22.0

8.0

Qwen2.5 VL

32B

22.66

19.33

11.0

34.0

13.0

35.33

46.0

24.0

36.0

13.33

4.0

26.0

10.0

Qwen2.5 VL

22.89

26.00

17.0

30.0

31.0

30.00

40.0

32.0

18.0

12.67

2.0

30.0

6.0

Medium

GPT 5 🥇

48.34

62.55

64.65

67.00

56.00

46.48

50.00

42.22

47.22

36.00

24.00

56.00

28.00

GPT 4o

36.99

45.49

48.48

33.89

41.67

26.67

33.33

31.33

Gemini 2.5 Pro

36.46

42.79

38.38

59.0

31.0

33.93

39.58

28.89

33.33

32.67

28.0

44.0

26.0

Gemini 2.5 flash think

37.52

47.82

46.47

56.00

41.00

36.99

43.75

42.22

25.00

28.00

12.00

44.00

28.00

Gemini 2.5 flash no-think

36.70

47.50

48.49

58.00

36.00

33.93

39.58

28.89

33.33

28.67

24.00

42.00

20.00

Gemini 1.5 Pro

33.89

39.47

42.42

33.52

33.33

42.22

28.67

Claude 3.5

35.35

41.78

35.35

50.0

40.0

35.60

39.58

42.22

25.0

28.67

16.0

44.0

26.0

InternVL2.5

26B

35.11

36.00

39.0

50.0

19.0

36.67

50.0

36.0

24.0

32.67

30.0

40.0

28.0

InternVL2.5

34.66

37.33

43.0

57.0

12.0

35.33

42.0

46.0

18.0

31.33

26.0

44.0

24.0

InternVL2.5

33.89

39.67

38.0

53.0

28.0

32.67

44.0

28.0

26.0

29.33

16.0

46.0

26.0

LLaVA Next

32B

20.0

27.33

16.0

49.0

17.0

10.67

14.0

10.0

8.0

22.0

16.0

36.0

14.0

LLaVA Video

25.67

25.00

20.0

34.0

26.0

28.67

36.0

28.0

22.0

23.33

14.0

40.0

16.0

LLaVA OneVision

16.67

16.00

26.0

30.0

16.0

14.67

18.0

8.0

18.0

19.33

12.0

30.0

16.0

Qwen2.5 VL

32B

28.55

28.33

21.0

44.0

20.0

33.33

40.0

30.0

24.00

8.0

40.0

24.0

Qwen2.5 VL

29.89

39.00

37.0

42.0

38.0

30.67

32.0

40.0

20.0

20.00

16.0

26.0

18.0

Easy

GPT 5 🥇

54.86

71.20

76.00

69.61

68.00

48.71

47.06

44.90

54.17

44.67

34.00

52.00

48.00

GPT 4o

42.17

52.35

47.06

47.16

54.9

44.9

41.67

27.00

Gemini 2.5 Pro

54.56

62.96

70.0

55.88

63.0

54.73

52.94

59.18

52.08

46.00

40.0

54.0

44.0

Gemini 2.5 flash think

50.00

67.56

69.00

65.69

68.00

44.45

52.94

40.82

39.58

38.00

32.00

38.00

44.00

Gemini 2.5 flash no-think

51.40

58.97

70.00

54.90

52.00

46.56

52.94

36.74

50.00

48.67

38.00

56.00

52.00

Gemini 1.5 Pro

46.00

51.33

36.92

49.02

36.73

50.00

Claude 3.5

48.59

60.33

61.0

50.0

70.0

36.35

35.29

51.02

22.73

49.33

64.0

44.0

40.0

InternVL2.5

26B

52.55

61.00

62.0

59.0

62.0

45.33

58.0

44.0

34.0

51.33

62.0

30.0

InternVL2.5

50.11

55.67

55.0

60.0

52.0

44.67

58.0

42.0

34.0

50.00

54.0

64.0

32.0

InternVL2.5

44.89

53.33

46.0

60.0

54.0

37.33

48.0

38.0

26.0

44.00

44.0

48.0

40.0

LLaVA Next

32B

31.25

38.00

35.0

45.0

34.0

21.33

12.0

14.0

38.0

34.67

20.0

50.0

34.0

LLaVA Video

31.44

33.00

30.0

31.0

38.0

33.33

38.0

36.0

26.0

28.00

16.0

32.0

36.0

LLaVA OneVision

29.78

32.00

31.0

33.0

32.0

24.00

26.0

30.0

16.0

33.33

28.0

36.0

Qwen2.5 VL

32B

43.22

51.00

58.0

50.0

45.0

41.33

46.0

38.0

40.0

37.33

32.0

44.0

36.0

Qwen2.5 VL

40.67

51.33

55.0

42.0

57.0

36.00

32.0

42.0

34.0

34.67

34.0

28.0

42.0

Leaderboard of Benchmark Evaluation in the Airplane

Evaluation of the **Airplane Navigation** using using **Short**, **Medium**, and **Long** videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.
Difficulty	Models	Size	Over. Avg.	Short Video Scenarios				Medium Video Scenarios				Long Video Scenarios
Difficulty	Models	Size	Over. Avg.	Avg.	Temporal	Spatial	Intent	Avg.	Temporal	Spatial	Intent	Avg.	Temporal	Spatial	Intent
Hard	GPT 5	-	28.11	26.67	18.00	30.00	32.00	26.00	32.00	34.00	12.00	31.67	30.00	20.00	45.00
	GPT 4o	-	18.11	21.33	16.00	26.00	22.00	14.67	12.00	30.00	2.00	18.33	5.00	35.00	15.00
	Gemini 2.5 Pro 🥇	-	31.39	32.83	36.0	24.49	38.0	24.67	32.0	22.0	20.0	36.67	30.0	15.0	65.0
	Gemini 2.5 flash think	-	25.78	26.00	26.00	18.00	34.00	21.33	28.00	18.00	18.00	30.00	30.00	10.00	50.00
	Gemini 2.5 flash no-think	-	25.44	25.33	22.00	28.00	26.00	26.00	26.00	28.00	24.00	25.00	0.00	40.00	35.00
	Gemini 1.5 Pro	-	22.34	26.67	24.00	26.00	30.00	18.67	20.00	22.00	14.00	21.67	10.00	25.00	30.00
	Claude 3.5	-	24.22	26.00	18.0	32.0	28.0	23.33	20.0	28.0	22.0	23.33	10.0	40.0	20.0
	InternVL2.5	26B	17.33	19.33	24.00	26.00	10.00	19.33	16.00	32.00	10.00	13.33	10.00	10.00	20.00
	InternVL2.5	8B	18.22	18.67	20.00	28.00	8.00	19.33	16.00	30.00	12.00	16.67	5.00	35.00	10.00
	InternVL2.5	4B	15.33	15.33	14.00	10.00	22.00	14.00	16.00	18.00	8.00	16.67	15.00	30.00	5.00
	LLaVA Next	32B	17.89	18.67	14.0	34.0	8.0	16.67	6.0	32.0	12.0	18.33	5.0	40.0	10.0
	LLaVA Video	7B	14.78	16.67	14.00	28.00	8.00	12.67	6.00	22.00	10.00	15.00	5.00	30.00	10.00
	LLaVA OneVision	7B	15.67	16.00	12.00	28.00	8.00	16.00	12.00	26.00	10.00	15.00	10.00	25.00	10.00
	Qwen2.5 VL	32B	16.22	20.00	6.00	36.00	18.00	15.33	4.00	24.00	18.00	13.33	0.00	30.00	10.00
	Qwen2.5 VL	7B	16.55	19.33	0.00	30.00	28.00	15.33	2.00	30.00	14.00	15.00	5.00	30.00	10.00

Medium	GPT 5	-	44.00	39.33	28.00	44.00	46.00	39.33	36.00	54.00	28.00	53.33	65.00	35.00	60.00
	GPT 4o	-	38.45	38.67	38.00	56.00	22.00	30.00	38.00	34.00	18.00	46.67	65.00	30.00	45.00
	Gemini 2.5 Pro	-	43.11	44.67	42.0	40.0	52.0	31.33	34.0	34.0	26.0	53.33	60.0	35.0	65.0
	Gemini 2.5 flash think	-	39.78	39.33	32.00	38.00	48.00	30.00	34.00	28.00	28.00	50.00	65.00	15.00	70.00
	Gemini 2.5 flash no-think 🥇	-	49.67	43.33	30.00	48.00	52.00	40.67	38.00	50.00	34.00	65.00	60.00	65.00	70.00
	Gemini 1.5 Pro	-	38.78	38.00	32.00	48.00	34.00	36.67	34.00	52.00	24.00	41.67	30.00	55.00	40.00
	Claude 3.5	-	39.67	38.00	26.0	40.0	48.0	36.00	32.0	54.0	22.0	45.00	50.0	35.0	50.0
	InternVL2.5	26B	28.67	31.33	28.00	58.00	8.00	24.67	12.00	50.00	12.00	30.00	25.00	45.00	20.00
	InternVL2.5	8B	34.33	30.00	20.00	58.00	12.00	34.67	32.00	50.00	22.00	38.33	40.00	45.00	30.00
	InternVL2.5	4B	32.22	29.33	28.00	44.00	16.00	34.00	30.00	54.00	18.00	33.33	35.00	40.00	25.00
	LLaVA Next	32B	26.11	24.67	18.0	40.0	16.0	25.33	18.0	40.0	18.0	28.33	25.0	40.0	20.0
	LLaVA Video	7B	24.00	25.33	24.00	36.00	16.00	20.00	16.00	26.00	18.00	26.67	15.00	45.00	20.00
	LLaVA OneVision	7B	23.67	23.33	20.00	34.00	16.00	22.67	20.00	32.00	16.00	25.00	20.00	35.00	20.00
	Qwen2.5 VL	32B	33.34	32.67	12.00	48.00	38.00	30.67	22.00	50.00	20.00	36.67	20.00	60.00	30.00
	Qwen2.5 VL	7B	28.00	24.67	16.00	24.00	34.00	26.00	24.00	26.00	28.00	33.33	35.00	20.00	45.00

Easy	GPT 5	-	52.00	47.33	42.00	42.00	58.00	48.67	46.00	46.00	54.00	60.00	65.00	35.00	80.00
	GPT 4o	-	40.67	35.33	30.00	28.00	48.00	36.67	24.00	38.00	48.00	50.00	45.00	50.00	55.00
	Gemini 2.5 Pro 🥇	-	52.56	56.00	60.0	48.0	60.0	40.00	40.0	36.0	44.0	61.67	75.0	35.0	75.0
	Gemini 2.5 flash think	-	50.67	49.33	40.00	46.00	62.00	46.00	46.00	44.00	48.00	56.67	55.00	40.00	75.00
	Gemini 2.5 flash no-think	-	50.78	49.33	36.00	52.00	60.00	48.00	40.00	50.00	54.00	55.00	60.00	50.00	55.00
	Gemini 1.5 Pro	-	43.00	45.33	36.00	44.00	56.00	42.00	48.00	32.00	46.00	41.67	35.00	50.00	40.00
	Claude 3.5	-	42.45	38.00	34.0	38.0	42.0	42.67	30.0	56.0	42.0	46.67	40.0	45.0	55.0
	InternVL2.5	26B	36.11	35.33	36.00	44.00	26.00	34.67	28.00	46.00	30.00	38.33	30.00	40.00	45.00
	InternVL2.5	8B	38.44	36.67	28.00	46.00	36.00	35.33	32.00	42.00	32.00	43.33	60.00	40.00	30.00
	InternVL2.5	4B	40.33	43.33	42.00	50.00	38.00	39.33	30.00	44.00	44.00	38.33	35.00	60.00	20.00
	LLaVA Next	32B	33.22	36.67	36.00	42.0	32.0	31.33	36.0	32.0	26.0	31.67	35.0	30.0	30.0
	LLaVA Video	7B	33.22	33.33	34.00	38.00	28.00	34.67	34.00	38.00	32.00	31.67	35.00	30.00	30.00
	LLaVA OneVision	7B	33.22	33.33	34.00	38.00	28.00	34.67	34.00	38.00	32.00	31.67	35.00	30.00	30.00
	Qwen2.5 VL	32B	52.45	50.00	34.00	56.00	60.00	50.67	40.00	54.00	58.00	56.67	55.00	60.00	55.00
	Qwen2.5 VL	7B	39.89	33.33	28.00	18.00	54.00	38.00	48.00	16.00	50.00	48.33	55.00	30.00	60.00

Leaderboard of Benchmark Evaluation in the Ship Motion

Evaluation of the **Ship Motion** using using **River** and **Ocean** videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, indicating a gradual rise in task difficulty.
Difficulty	Models	Size	Over. Avg.	River Scenarios				Ocean Scenarios
Difficulty	Models	Size	Over. Avg.	Avg.	Temporal	Spatial	Intent	Avg.	Temporal	Spatial	Intent
Hard	GPT 5 🥇	-	38.36	48.72	46.15	30.77	69.23	28.00	30.00	32.00	22.00
	GPT 4o	-	22.10	28.20	38.46	26.92	19.23	16.00	18.00	18.00	12.00
	Gemini 2.5 Pro	-	29.64	34.62	23.08	34.62	46.15	24.67	38.0	16.0	20.0
	Gemini 2.5 flash think	-	27.36	32.05	30.77	26.92	38.46	22.67	30.00	22.00	16.00
	Gemini 2.5 flash no-think	-	27.44	28.21	42.31	19.23	23.08	26.67	36.00	20.00	24.00
	Gemini 1.5 Pro	-	26.02	26.92	23.08	30.77	26.92	25.11	34.00	20.93	20.41
	Claude 3.5	-	25.44	28.20	19.23	19.23	46.15	22.67	26.0	22.0	20.0
	InternVL2.5	26B	22.54	23.08	15.38	19.23	34.62	22.00	18.00	28.00	20.00
	InternVL2.5	8B	21.90	21.79	7.69	26.92	30.77	22.00	16.00	28.00	22.00
	InternVL2.5	4B	20.92	20.51	19.23	19.23	23.08	21.33	16.00	26.00	22.00
	LLaVA Next	32B	14.39	11.54	7.69	19.23	7.69	15.33	8.0	30.0	8.0
	LLaVA Video	7B	14.00	16.67	15.38	23.08	11.54	11.33	8.00	20.00	6.00
	LLaVA OneVision	7B	15.67	16.67	11.54	26.92	11.54	14.67	8.00	28.00	8.00
	Qwen2.5 VL	32B	13.39	14.10	7.69	23.08	11.54	12.67	8.0	24.0	6.0
	Qwen2.5 VL	7B	14.67	16.67	7.69	30.77	11.54	12.67	6.00	24.00	8.00

Medium	GPT 5 🥇	-	51.80	60.26	53.85	46.15	80.77	43.33	56.00	48.00	26.00
	GPT 4o	-	38.49	42.31	50.00	53.85	23.08	34.67	36.00	48.00	20.00
	Gemini 2.5 Pro	-	41.77	44.87	30.77	61.54	42.31	38.67	48.0	46.0	22.0
	Gemini 2.5 flash think	-	48.26	53.85	61.54	57.70	42.31	42.67	52.00	42.00	34.00
	Gemini 2.5 flash no-think	-	46.12	50.00	46.15	57.69	46.15	42.00	56.00	44.00	26.00
	Gemini 1.5 Pro	-	46.31	53.84	46.15	65.38	50.00	38.78	34.00	49.02	33.33
	Claude 3.5	-	38.62	35.90	34.62	50.0	23.08	41.33	42.0	54.0	28.0
	InternVL2.5	26B	41.77	44.87	30.77	57.69	46.15	38.67	24.00	62.00	30.00
	InternVL2.5	8B	41.08	46.15	34.62	61.54	42.31	36.00	34.00	60.00	14.00
	InternVL2.5	4B	44.36	48.72	23.08	65.38	57.69	40.00	28.00	60.00	32.00
	LLaVA Next	32B	20.88	23.08	11.54	38.46	19.23	18.67	10.00	30.00	16.00
	LLaVA Video	7B	21.92	20.51	19.23	26.92	15.38	23.33	20.00	30.00	20.00
	LLaVA OneVision	7B	22.54	23.08	19.23	30.77	19.23	22.00	14.00	34.00	18.00
	Qwen2.5 VL	32B	33.31	34.62	19.23	50.00	34.62	32.00	20.00	50.00	26.00
	Qwen2.5 VL	7B	24.08	29.49	19.23	30.77	38.46	18.67	18.00	26.00	12.00

Easy	GPT 5 🥇	-	63.00	66.67	61.54	50.00	88.46	59.33	78.00	48.00	52.00
	GPT 4o	-	50.51	57.69	57.69	50.00	65.38	43.33	66.00	34.00	30.00
	Gemini 2.5 Pro	-	61.05	64.10	57.69	57.69	76.92	58.00	72.0	50.0	52.0
	Gemini 2.5 flash think	-	62.03	65.39	80.77	42.31	73.08	58.67	70.00	52.00	54.00
	Gemini 2.5 flash no-think	-	58.18	57.69	57.69	38.46	76.92	58.67	80.00	42.00	54.00
	Gemini 1.5 Pro	-	50.69	52.56	42.31	61.54	53.85	48.81	50.00	46.43	50.00
	Claude 3.5	-	49.39	47.44	50.0	53.85	38.46	51.33	62.0	52.0	40.0
	InternVL2.5	26B	55.05	64.10	65.38	57.69	69.23	46.00	50.00	50.00	38.00
	InternVL2.5	8B	53.47	60.26	69.23	46.15	65.38	46.67	46.00	54.00	40.00
	InternVL2.5	4B	53.87	56.41	53.85	57.69	57.69	51.33	52.00	56.00	46.00
	LLaVA Next	32B	35.59	37.18	26.92	53.85	30.77	34.00	30.00	38.00	34.00
	LLaVA Video	7B	31.03	32.05	30.77	34.62	30.77	30.00	22.00	38.00	30.00
	LLaVA OneVision	7B	33.00	33.33	34.62	34.62	30.77	32.67	28.00	38.00	32.00
	Qwen2.5 VL	32B	52.77	61.54	53.85	61.54	69.23	44.00	40.00	54.00	38.00
	Qwen2.5 VL	7B	31.31	34.62	38.46	19.23	46.15	28.00	36.00	22.00	26.00

Opportunities!

1. Long-Horizon Temporal Reasoning: Future work can explore models with stronger memory and temporal abstraction capabilities to handle long-duration videos with multiple events and delayed causal effects, especially in air and water scenarios.

2. Generalization Across Domains and Modalities: Cross-domain generalization—from land to air or water—and transfer learning across modalities (e.g., combining video, text, and audio) remain underexplored and crucial for building versatile systems.

3. Safety-Aware and Verifiable Reasoning: Given the high-stakes nature of open-space applications (e.g., autonomous driving or aircraft control), future benchmarks and methods should integrate safety constraints and provide interpretable or verifiable reasoning processes.

BibTeX

@article{gu2025accidentbench, title={AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond}, author={Gu, Shangding and Wang, Xiaohan and Ying, Donghao and Zhao, Haoyu and Yang, Runing and Jin, Ming and Li, Boyi and Pavone, Marco and Yeung-Levy, Serena and Wang, Jun and others}, journal={arXiv preprint arXiv:2509.26636}, year={2025} }

AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

Abstract

Accident Vehicle Scenarios

Leaderboard of Benchmark Evaluation in the Accident Vehicles

Airplane Navigation Scenarios

Leaderboard of Benchmark Evaluation in the Airplane

Ship Motion Scenarios

Leaderboard of Benchmark Evaluation in the Ship Motion

Opportunities!

BibTeX