Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

School of Computer Science and Informatics, University of Liverpool
ICLR 2026
Teaser Image

A Comprehensive Overview of the Spatial-DISE Framework, Generation Pipeline, and Benchmark Statistics. a) Comparison of examples from existing benchmarks, which primarily test general static reasoning, with cognition intrinsic-dynamic tasks from our Spatial-DISE benchmark. b) introduces the core DISE taxonomy, showing the four quadrants of spatial reasoning and their distribution in the 559-pair evaluation bench. c) presents evaluation results, showing a significant gap between model and human performance. d) details the synthetic data generation pipeline implemented in Blender, and e) provides a statistical breakdown of the task categories within both the Spatial-DISE Bench and the Spatial-DISE-12K.

Abstract

Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the intrinsic-dynamic spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, Spatial-DISE, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: Intrinsic-Static, Intrinsic-Dynamic, Extrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new Spatial-DISE dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.

Spatial-DISE Taxonomy

We introduce the DISE taxonomy, which categorizes spatial reasoning tasks into four quadrants based on two dimensions: Intrinsic vs. Extrinsic and Static vs. Dynamic. The intrinsic-extrinsic dimension distinguishes between tasks that require understanding an object's internal properties (intrinsic) and those that involve the object's relationship with its environment (extrinsic). The static-dynamic dimension separates tasks that involve unchanging spatial relationships (static) from those that require mental manipulation or transformation of objects (dynamic). This taxonomy provides an unified framework for evaluating and developing vision-language models' spatial reasoning capabilities.

Spatial-DISE Taxonomy
Spatial-DISE Classification

Spatial-DISE Tasks

Evaluation

Evaluation Table

Evaluation results of 28 SOTA models and 2 models SFT on Spatial-DISE. Row colors: Base , ∆ vs base , Reasoning , Spatial , SFT on Spatial-DISE-12k . A ∆ row shows the absolute change in percentage points (pp) relative to its base model and is placed between the parent and the derived model. Values are accuracy (%); brackets use [lower, upper] for the 95% CI. Bold indicates the highest accuracy; Underline indicates the second highest.

Fine-tuning Results (SFT)

Benchmark SpaceOm Qwen2.5-VL-7B
Base +DISE SFT ∆ (pp) Base +DISE SFT ∆ (pp)
Spatial-DISE 25.9% 41.3% ↑ 15.4 26.1% 47.0% ↑ 20.9
CVBench 68.8% 70.33% ↑ 1.53 75.9% 77.4% ↑ 1.5
SAT 46.67% 49.33% ↑ 2.66 65.3% 69.3% ↑ 4.0
SPACE 27.22% 32.6% ↑ 5.38 28.7% 32.2% ↑ 3.5
OmniSpatial 27.91% 34.28% ↑ 6.37 21.8% 34.0% ↑ 12.2
VSIBench_MCQ 31.05% 33.7% ↑ 2.65 19.3% 22.6% ↑ 3.3

Evaluation results of SpaceOm and Qwen2.5-VL-7B before and after Fine-Tuning (SFT) on Spatial-DISE-12K.

Key Findings

Our comprehensive evaluation reveals that spatial reasoning remains a significant challenge for current Vision-Language Models (VLMs). Below are the key insights from our analysis:

Universal Challenge

Spatial reasoning is a universal bottleneck. The average accuracy across 33 tested models was only 28.4%, marginally above random chance (25%) and far below the human baseline of 76.8%. Even the reasoning-enhanced Doubao1.5-VL-thinking only achieved 42.0%.

Reasoning vs. Perception

The primary failure mode is reasoning, not perception. Our error analysis indicates that 72.5% of failures are due to reasoning errors (e.g., failure in rule application or mental simulation), while perceptual errors account for only 17.5%.

Multi-Step Reasoning Deficit

Models struggle with tasks requiring sequential mental transformations (e.g., Fold and Punch). This suggests a critical deficit in "spatial working memory," preventing models from reliably tracking objects through a sequence of changes.

Transfer Learning Potential

Fine-tuning on Spatial-DISE-12K yields large gains. Notably, training on Extrinsic-Dynamic 3D tasks transfers well to 2D tasks, suggesting that scene-centric dynamic reasoning supports a reusable representation for various spatial problems.

Detailed Analysis

Human Performance Breakdown

We analyzed human performance using both Classical Test Theory (CTT) and Item Response Theory (IRT) to establish a robust baseline.

Human Performance by DISE Quadrant
Human accuracy across the four DISE quadrants.
Human Accuracy Distribution
Distribution of human accuracy scores.
IRT vs CTT Comparison
Comparison of Classical Test Theory (CTT) and Item Response Theory (IRT) results.

Transfer Learning Analysis

We investigated how training on specific 3D tasks influences performance on other quadrants and 2D tasks.

Cross-Quadrant Transfer
Cross-Quadrant Transfer
Heatmap showing how fine-tuning on one DISE quadrant affects performance on others. Note the strong diagonal (domain specialization) and specific transfer paths.
3D to 2D Transfer
3D to 2D Transfer
Impact of 3D training on 2D tasks. Extrinsic-Dynamic training shows broad positive transfer to 2D tasks.

Error Analysis

We conducted a detailed manual analysis of model failures. Below are examples of common error types including Reasoning Errors, Perceptual Errors, and Comprehension Errors.

BibTeX

@inproceedings{huang2025spatialdise,
  title = {Spatial-{{DISE}}: {{A Unified Benchmark}} for {{Evaluating Spatial Reasoning}} in {{Vision-Language Models}}},
  booktitle = {The {{Fourteenth International Conference}} on {{Learning Representations}}},
  author = {Huang, Xinmiao and He, Qisong and Huang, Zhenglin and Wang, Boxuan and Li, Zhuoyun and Cheng, Guangliang and Dong, Yi and Huang, Xiaowei},
  year = 2025
}