We introduce the DISE taxonomy, which categorizes spatial reasoning tasks into four quadrants based on two dimensions: Intrinsic vs. Extrinsic and Static vs. Dynamic. The intrinsic-extrinsic dimension distinguishes between tasks that require understanding an object's internal properties (intrinsic) and those that involve the object's relationship with its environment (extrinsic). The static-dynamic dimension separates tasks that involve unchanging spatial relationships (static) from those that require mental manipulation or transformation of objects (dynamic). This taxonomy provides an unified framework for evaluating and developing vision-language models' spatial reasoning capabilities.
Question: The two views above show the same 3D structure from different
perspectives.
Which option below could be combined to form the same structure without rotation or overlap?
Options: A, B, C, D
Answer: D
Question: Look at the 3D shape above. After rotating the shape in your mind, choose
the
option that matches it exactly.
Options: A, B, C, D
Answer: D
Question: The image shows the six faces of a cube in an unfolded net. After
folding,
which
option corresponds to the resulting cube?
Options: A, B, C, D
Answer: C
Question: Which option is the most likely 2D view of the structure from the top
direction?
Options: A, B, C, D
Answer: B
Question: Based on two views of the same cube, what is the most likely image on the
highlighted face?
Options: A, B, C, D
Answer: A
Question: Which figure can be made from these shapes without resizing?
Options: A, B, C, D
Answer: A
Question: Which figure is a rotation of the object?
Options: A, B, C, D
Answer: C
Question: The shape on the left is hidden in one of the figures on the right
without
any
rotation or flipping. Which figure contains it?
Options: A, B, C, D
Answer: B
Question: Work out which option shows the figure when folded along the dotted
line.
Options: A, B, C, D
Answer: A
Question: A square is folded and holes are punched. Which option shows the unfolded
square correctly?
Options: A, B, C, D
Answer: C
| Benchmark | SpaceOm | Qwen2.5-VL-7B | ||||
|---|---|---|---|---|---|---|
| Base | +DISE SFT | ∆ (pp) | Base | +DISE SFT | ∆ (pp) | |
| Spatial-DISE | 25.9% | 41.3% | ↑ 15.4 | 26.1% | 47.0% | ↑ 20.9 |
| CVBench | 68.8% | 70.33% | ↑ 1.53 | 75.9% | 77.4% | ↑ 1.5 |
| SAT | 46.67% | 49.33% | ↑ 2.66 | 65.3% | 69.3% | ↑ 4.0 |
| SPACE | 27.22% | 32.6% | ↑ 5.38 | 28.7% | 32.2% | ↑ 3.5 |
| OmniSpatial | 27.91% | 34.28% | ↑ 6.37 | 21.8% | 34.0% | ↑ 12.2 |
| VSIBench_MCQ | 31.05% | 33.7% | ↑ 2.65 | 19.3% | 22.6% | ↑ 3.3 |
Evaluation results of SpaceOm and Qwen2.5-VL-7B before and after Fine-Tuning (SFT) on Spatial-DISE-12K.
Our comprehensive evaluation reveals that spatial reasoning remains a significant challenge for current Vision-Language Models (VLMs). Below are the key insights from our analysis:
Spatial reasoning is a universal bottleneck. The average accuracy across 33 tested models was only 28.4%, marginally above random chance (25%) and far below the human baseline of 76.8%. Even the reasoning-enhanced Doubao1.5-VL-thinking only achieved 42.0%.
The primary failure mode is reasoning, not perception. Our error analysis indicates that 72.5% of failures are due to reasoning errors (e.g., failure in rule application or mental simulation), while perceptual errors account for only 17.5%.
Models struggle with tasks requiring sequential mental transformations (e.g., Fold and Punch). This suggests a critical deficit in "spatial working memory," preventing models from reliably tracking objects through a sequence of changes.
Fine-tuning on Spatial-DISE-12K yields large gains. Notably, training on Extrinsic-Dynamic 3D tasks transfers well to 2D tasks, suggesting that scene-centric dynamic reasoning supports a reusable representation for various spatial problems.
We analyzed human performance using both Classical Test Theory (CTT) and Item Response Theory (IRT) to establish a robust baseline.
We investigated how training on specific 3D tasks influences performance on other quadrants and 2D tasks.
We conducted a detailed manual analysis of model failures. Below are examples of common error types including Reasoning Errors, Perceptual Errors, and Comprehension Errors.
@inproceedings{huang2025spatialdise,
title = {Spatial-{{DISE}}: {{A Unified Benchmark}} for {{Evaluating Spatial Reasoning}} in {{Vision-Language Models}}},
booktitle = {The {{Fourteenth International Conference}} on {{Learning Representations}}},
author = {Huang, Xinmiao and He, Qisong and Huang, Zhenglin and Wang, Boxuan and Li, Zhuoyun and Cheng, Guangliang and Dong, Yi and Huang, Xiaowei},
year = 2025
}