BabyVision: Visual Reasoning Beyond Language
2026-01-12
Contributors: Liang Chen1,6*, Weichu Xie1,6*, Yiyan Liang1,6*, Hongfeng He1*, Hans Zhao1*, Zhibo Yang3, Zhiqi Huang4, Haoning Wu4, Haoyu Lu4, Y.charles4, Yiping Bao4, Yuantao Fan5, Guopeng Li5, Haiyang Shen1,6, Xuanzhong Chen1,7, Wendong Xu1, Shuzheng Si7, Zefan Cai8, Wenhao Chai9, Ziqi Huang10, Fangfu Liu7, Tianyu Liu6, Baobao Chang6, Xiaobo Hu2, Kaiyuan Chen2, Yixin Ren2, Yang Liu2, Yuan Gong2, Kuan Li1
Affiliations: 1UniPat AI, 2xbench, 3Alibaba Group, 4MoonShot AI, 5Stepfun, 6Peking University, 7Tsinghua University, 8University of Wisconsin–Madison, 9Princeton University, 10Nanyang Technological University
Correspondence: Liang Chen <liangchen@unipat.ai>, Kuan Li <kuanli@unipat.ai>
Can MLLMs See Like a 3-Year-Old?
Introduction
Designing benchmarks is essential for measuring AI progress and guiding future research. Yet benchmarking LLMs is becoming increasingly difficult: today’s models achieve exceptional scores on elite tasks such as Humanity’s Last Exam (HLE) and the International Mathematical Olympiad (IMO), reaching—even surpassing—PhD-level performance in language and textual reasoning.
But do multimodal LLMs (MLLMs) show similar expertise in vision? Meaningful evaluation requires disentangling visual ability from linguistic ability. When a visual task can be fully verbalized, it effectively becomes a language problem, allowing models to lean on strong textual reasoning rather than genuine visual understanding. Robust vision benchmarks must therefore be designed to minimize such linguistic shortcuts.
To rigorously assess models’ pure visual reasoning ability and compare it with human performance, we curate 20 vision-centric tasks—collectively referred to as BabyVision-Mini. We then evaluate models alongside children aged 3–12. The results are striking: even the strongest models perform at approximately the level of a three-year-old child. Gemini 3-Pro-Preview still lags typical six-year-old performance by roughly 20%, and other MLLMs fall below the average abilities of a three-year-old.
Pilot Study Result on BabyVision-Mini Dataset. We report the average pass@1 accuracy for 3 random runs for MLLMs and the average accuracy among human testers of different ages.
This striking gap compels us to ask: why do models that master PhD-level language struggle with 3-year-old-level vision? Specifically, which fundamental components of visual understanding are absent in current MLLMs? In response to these questions, we introduce BabyVision, a multi-modal benchmark at the starting point of human visual reasoning.
In the construction of the full benchmark, we split vision-centric reasoning into four core categories—Fine-grained Discrimination, Visual Tracking, Spatial Perception, and Visual Pattern Recognition—which together comprise 22 basic subtypes, each targeting a specific fundamental visual capability. We then employ a careful and rigorous data curation pipeline—including data collection, filtering, annotation, and cross-checking—to construct 388 questions spanning a wide diversity of visual reasoning tasks.
- Fine-grained Discrimination — Detecting subtle visual differences (8 subtypes)
- Visual Tracking — Following paths, lines, and trajectories (5 subtypes)
- Spatial Perception — Understanding 3D structures and relationships (5 subtypes)
- Visual Pattern Recognition — Identifying logical and geometric patterns (4 subtypes)
Our philosophy is not to stump the model, but to measure the "atomic capabilities" of a model's visual reasoning—those fundamental tasks that are intuitive to humans but serve as the building blocks of visual intelligence.
Quantitative Results
We test BabyVision on leading open-source and proprietary MLLMs and human baselines. We use two versions: BabyVision-Mini (20 questions) and the full BabyVision (388 questions across 22 atomic types).
Children aged 3–12 (20+ per group, one school, consent obtained) take the Mini, while 16 adults complete the full benchmark.
Together, the evaluations ask: How old does the MLLM look? and Which visual primitives are missing?
BabyVision-Mini: Comparing Young Humans and Models
As our pilot study, BabyVision-Mini is built for meaningful developmental comparison. Its tasks are strictly vision-centric, minimizing language and prior-knowledge demands so that scores reflect visual reasoning rather than text-based inference. Its small size also makes it practical to complete within a single class period for young children.
Under this lens, the gap is striking as shown in the comparison figure (see Figure 1). Most frontier MLLMs perform well below the average 3-year-old, despite their PhD-level results on language benchmarks. Gemini3-Pro-Preview is the notable outlier—the only model consistently above the Age-3 band—yet it still lags typical 6-year-olds by ~20 points.
This highlights a core limitation: the issue is not solving “hard problems,” but struggling with pre-language visual primitives—the early perceptual and spatial abilities humans acquire before language becomes the main reasoning tool.
BabyVision-Full: A Full Capability Profile of MLLMs
In the full and fine-grained evaluaiton, models' best performance is still far from human-level (94.1%). Across closed-source systems, Gemini3-Pro-Preview leads overall (49.7%), followed by GPT-5.2 (34.4%) and Doubao-Seed-1.8 (30.2%), with other models substantially lower (e.g., Qwen3-VL-Plus 19.2%, Grok-4 16.2%, Claude-4.5-Opus 14.2%).
These gaps relative to humans are consistent across categories: performance drops appear in all four families, not just one. This suggests current models lack foundational visual competencies overall—a systemic limitation, not an isolated weakness.
Performance (Pass@1) of Closed Source MLLMs on BabyVision. The best results for each question type are marked in bold. Reported values represent the average Pass@1 accuracy across three random runs, accompanied by the standard deviation. All models are in thinking mode with highest reasoning budget.
Gemini3-Pro vs. the Rest
Gemini3-Pro-Preview not only leads the overall leaderboard but is also strong across all four families, suggesting a more genuinely visual backbone than competing models. GPT-5.2 ranks second overall and leads in Visual Pattern Recognition, but trails Gemini on more perception-heavy tasks—especially fine-grained discrimination and visual tracking.
For open-source models, the best performer (Qwen3VL-235B-Thinking) reaches 22.2% overall. Two trends emerge. First, test-time “thinking” yields measurable gains: within Qwen3VL, the Thinking variant generally outperforms the Instruct variant (e.g., 22.2% vs. 19.5% at 235B), indicating that explicit intermediate reasoning can partly offset visual uncertainty once the signal is extracted. Second, scaling helps but saturates quickly: even the largest open model remains far below the best closed-source system, implying that more parameters or longer chains alone are insufficient—what’s missing likely relates to data and training paradigms that foster visual rather than text-dominant reasoning.
In short, today’s MLLMs pair strong language reasoning with immature visual foundations. BabyVision quantifies this mismatch and offers fine-grained diagnostics to guide progress toward truly grounded visual reasoning.
The "Unspeakable" Challenge in Visual Reasoning
Why do MLLMs fail at these seemingly simple tasks? The key insight is that these problems are "unspeakable"—they cannot be fully described in language without information loss. When models try to reason through text, they lose critical visual details.
The core problem: MLLMs try to compress visual reasoning into language tokens, but these tasks require direct perceptual processing that cannot be faithfully represented in text. We summarize 4 classic vision-centric challenges for current MLLMs observed during our evaluation.
Challenge 1: Observing Non-Verbal Details
A pervasive weakness we observe across BabyVision is the loss of fine, non-verbal detail. When a solution depends on subtle visual cues—such as a tiny offset, a specific boundary curve, or a single-pixel difference—MLLMs often treat distinct choices as interchangeable. The issue is not logical difficulty, but a lack of high-fidelity perception.
Humans typically solve such tasks almost instantly through direct shape matching: mentally translating and rotating each candidate to check boundary alignment. This is a largely perceptual operation—continuous, parallel, and geometry-driven—without needing to name or describe anything.
MLLMs, by contrast, rely on implicit verbalization: (1) Verbalize the shape ("a hook at the top, two legs at the bottom"), (2) Reduce to coarse features (approximate counts, gross topology), (3) Compare candidates in language space. This compression becomes an information bottleneck—once fine structure is flattened into words, micro-differences become indistinguishable.
Core weakness: MLLMs struggle to preserve and manipulate fine spatial structure end-to-end. Even young children can reliably judge "fit" versus "mismatch" through direct visual comparison—this is a perception problem, not a reasoning problem.
Challenge 2: Manifold Understanding
Another failure mode we observe is loss of manifold identity: MLLMs struggle to maintain continuous identity of even a thin curve. When the answer is encoded in connectivity—not in object semantics—models often degrade from "following a line" to "guessing an endpoint."
Humans solve such tasks by visual tracking: they "lock onto" one curve and continuously follow it through crossings until it terminates. This is an early-acquired visual routine—the perceptual system performs contour integration and maintains "which line I am on" through intersections almost automatically, without naming intermediate steps.
For MLLMs, the core difficulty is that the answer is encoded in the connectivity of a 1D manifold embedded in 2D—a continuous curve that winds, overlaps, and self-intersects. The model tries to translate the curve into discrete instructions (left/right/up/down), but crossings create combinatorial branching. Without a faithful, persistent representation of the curve, the model easily "switches tracks" after a crossing—an error visually obvious to humans but difficult to detect once compressed into words.
Core weakness: MLLMs do not reliably maintain perceptual identity across extended spatial trajectories. Success depends on robust contour integration, continuity-preserving tracking, and resistance to interference from nearby curves—capabilities humans acquire effortlessly in early childhood.
Challenge 3: Spatial Imagination
A third pervasive bottleneck we observe is spatial imagination: the ability to construct a stable internal 3D representation from a 2D depiction, then mentally transform it (change viewpoint, project to silhouette, infer hidden volume) while preserving structural consistency. This skill is fundamental to human vision—children develop it early through play with blocks, drawings, and everyday navigation.
Humans solve such tasks by a brief act of imagination: they mentally view the object from the indicated direction and simply count or compare. Importantly, this is not a verbal process—people do not enumerate every element in language; they just hold the image in their mind and reason directly.
MLLMs, by contrast, translate the visual scene into a language summary before reasoning: (1) Approximate the viewpoint ("arrow points from lower right"), (2) Describe the object in words ("Analyze the Structure's Dimensions (Grid and Heights)"), (3) Guess the 2D features from coarse descriptions. The breakdown is that narration is not a faithful spatial state—once the precise image is compressed into a vague text summary, the model makes predictable errors: missing hidden blocks, miscounting layers, or applying wrong 3D projections.
Core weakness: MLLMs do not reliably "imagine" the 3D object. Spatial imagination—the ability to preserve structure while transforming it—is an ability humans acquire early through perception and interaction, but current MLLMs still rely on language logic, which is a poor substitute for simply holding a shape in mind.
Challenge 4: Visual Pattern Induction
A fourth challenge we repeatedly observe is visual pattern induction: the ability to abstract a generalized transformation rule from a few visual examples and apply it to a new input. Humans typically handle such problems by comparing visual example pairs directly, constructing a small causal graph: which shape contains which, which element is the frame, and how these roles are reassigned from input to output.
The key human ability is to see relational rules (what changed) rather than object attributes (what is there). Whether the pattern involves rotation, swap, reflection, or containment—humans extract the abstract transformation and apply it to novel inputs. The specific shapes, colors, or positions do not matter; only their roles in the transformation do.
MLLMs, by contrast, approach such problems through attribute counting rather than relational mapping. Instead of seeing an abstract operation, they rely on semantic description: describe the source, describe the target, attempt to bridge via text. This approach fails because the model often hallucinates rules based on surface features (e.g., "...two green, two brown, and four orange segments") rather than structural logic. The model focuses on objects as fixed entities rather than elements in a transformation sequence.
Core weakness: MLLMs often mix up appearance with structure. Pattern induction requires ignoring specific visual elements to see the abstract pattern. Success in these tasks requires abstract reasoning over visual relations—a step beyond simple recognition that remains a significant hurdle for current architectures.
Insights from Training: How to Achieve Better Result on BabyVision.
We have identified large performance gaps in BabyVision, not only among human and frontier models, but also across closed- and open-source models. We are further interested in: how can we develop stronger visual reasoning skills and achieve better scores on BabyVision with open models?
As Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated strong gains in language-reasoning performance for LLMs, we conduct a preliminary study to investigate whether RLVR can similarly improve visual abilities measured by BabyVision. We use Qwen3-VL-8B-Thinking as the base model and apply RLVR fine-tuning. For data collection, we adopt a BabyVision-style pipeline but draw from larger image sources and remove duplicates, yielding 1,400 training examples. The collected data covers all four major BabyVision task families, yet its difficulty distribution is not completely aligned with BabyVision: the model achieves 34.2% inital accuracy on the RLVR training set, but only 13.1% on BabyVision, when evaluated with the same base model.
We fine-tune Qwen3-VL-8B-Thinking for 450 steps using the GRPO algorithm. We observe that RLVR is effective on the collected training dataset: both training accuracy and held-out test accuracy consistently improve over the course of training.
The BabyVision performance of Qwen3-VL-8B-Thinking before and after RL fine-tuning is reported in the following table. The model achieves a +4.8-point overall accuracy improvement after RLVR training. We also observe consistent gains across most task subtypes, with the sole exception of visual tracking, for which RL fine-tuning yields little to even negative improvement. We hypothesize that this is because visual tracking is the least amenable to verbalization; since RLVR primarily enhances performance by encouraging longer and more structured “thinking-token” reasoning, it provides less benefit on tasks that depend on continuous perceptual tracking rather than language-mediated reasoning.
| Category | Before RL | After RL | Improvement |
|---|---|---|---|
| Overall | 13.1 | 17.9 | +4.8 |
| Fine-grained Discrimination | 12.7 | 19.4 | +6.8 |
| Visual Tracking | 10.8 | 9.6 | -1.2 |
| Spatial Perception | 15.0 | 20.9 | +5.9 |
| Visual Pattern Recognition | 15.0 | 20.9 | +5.9 |
Beyond VLMs: Can Generation Help Reasoning?
If text-based reasoning proves insufficient, a natural question arises: can visual generation bridge this gap? Rather than describing solutions in words, could models draw the answer—mirroring how children intuitively point to, trace, or mark solutions when reasoning visually?
Motivated by this insight, we introduce BabyVision-Gen, a generative extension of BabyVision that evaluates whether image- and video-generation models can perform visual reasoning through visual outputs. BabyVision-Gen comprises 280 questions re-annotated from the original benchmark to support generation-based evaluation, where correctness can be directly and unambiguously verified by comparing the model-generated outputs with human-drawn ground-truth solutions. We also develop an automatic evaluation tool for generation models that achieves a 0.96 agreement with human evaluators.
Generation Models Show a Promising Direction
Naturally, generative models introduce a new test-time scaling dimension for visual reasoning, enabling explicit image manipulation, intermediate markups, and hint drawing during the reasoning process. We evaluate BabyVision-Gen on several state-of-the-art image and video generation models, including Nano-Banana-Pro, GPT-1.5-Image, Qwen-Image-Edit-2511, Veo-3, and Sora-2.
| Category | Nano-Banana-Pro | GPT-1.5-Image | Qwen-Image-Edit-2511 |
|---|---|---|---|
| Overall | 18.3 | 9.8 | 4.8 |
| Fine-grained Discrimination | 24.5 | 9.6 | 4.7 |
| Visual Tracking | 6.7 | 2.4 | 0.0 |
| Spatial Perception | 13.0 | 12.4 | 7.3 |
| Visual Pattern Recognition | 22.8 | 16.7 | 7.9 |
Our results reveal that video generation could serve as a paradigm for multimodal reasoning on tasks that remain challenging for vision–language models (VLMs) as illustrated below. However, despite these encouraging behaviors, current generative models still struggle to consistently arrive at fully correct solutions in most cases as shown in the performance table above. Nevertheless, these findings point to a compelling direction similar to "Video models are zero-shot learners and reasoners": video generation models hold strong potential to evolve into well-rounded multimodal reasoners, especially when visual reasoning is grounded in explicit visual manipulation rather than language alone.
Task: Draw a red line to trace the complete line extending from the top left figure.
Sora-2:
NanoBanana-Pro:
From the generated outputs in the visual tracking task, we observe that these two models exhibit the most human-like visual thinking processes, explicitly drawing trajectories along the paths in the image. However, despite this alignment with human behavior, their generations still contain noticeable errors, indicating that further alignment is required.
Nevertheless, these findings point to a clear and compelling direction: video generation models hold strong potential to evolve into well-rounded multimodal reasoners, especially when visual reasoning is grounded in explicit visual manipulation rather than language alone.
Cases for Generative Visual Reasoning
Below we compare multiple generation models on BabyVision-Gen tasks:
More Examples
Conclusion
BabyVision reveals a striking truth: currently MLLMs do not have robust foundational visual competence even compared to children. Despite their impressive performance on language-heavy and expert-level benchmarks, today's MLLMs still struggle with the pre-linguistic visual primitives that humans acquire in early childhood.
By decomposing visual intelligence into atomic capabilities and benchmarking them independently of language, BabyVision exposes where current models fall short and why scaling language alone is insufficient. Our results further suggest that visual generation—reasoning by drawing, tracing, and manipulating images—offers a promising path forward, partially recovering capabilities that text-based reasoning cannot express.
These atomic visual abilities are also critical for embodied AI: it is difficult to imagine a robot with visual competence below that of a three-year-old being able to reliably assist humans in the physical world. BabyVision provides a diagnostic lens and a research direction: to advance multimodal intelligence, future models must rebuild vision from the ground up rather than linguistic shortcuts.
Acknowledgements
We would like to thank Xiaotao Gu (Zhipu AI), Junyang Lin (Alibaba Group), Shuai Bai (Alibaba Group) and Shuhuai Ren (Xiaomi MiMo) for their valuable discussions and insightful feedbacks throughout this project.
Citation
For details of BabyVision, please read our paper. If you find it useful in your research, please kindly cite:
@article{babyvision2026,
title={BabyVision: Visual Reasoning Beyond Language},
year={2026}
}