Blog

BabyVision: Visual Reasoning Beyond Language

2026-01-12

Contributors: Liang Chen1,6*, Weichu Xie1,6*, Yiyan Liang1,6*, Hongfeng He1*, Hans Zhao1*, Zhibo Yang3, Zhiqi Huang4, Haoning Wu4, Haoyu Lu4, Y.charles4, Yiping Bao4, Yuantao Fan5, Guopeng Li5, Haiyang Shen1,6, Xuanzhong Chen1,7, Wendong Xu1, Shuzheng Si7, Zefan Cai8, Wenhao Chai9, Ziqi Huang10, Fangfu Liu7, Tianyu Liu6, Baobao Chang6, Xiaobo Hu2, Kaiyuan Chen2, Yixin Ren2, Yang Liu2, Yuan Gong2, Kuan Li1

Affiliations: 1UniPat AI, 2xbench, 3Alibaba Group, 4MoonShot AI, 5Stepfun, 6Peking University, 7Tsinghua University, 8University of Wisconsin–Madison, 9Princeton University, 10Nanyang Technological University

Correspondence: Liang Chen <liangchen@unipat.ai>, Kuan Li <kuanli@unipat.ai>

Can MLLMs See Like a 3-Year-Old?

Introduction

Designing benchmarks is essential for measuring AI progress and guiding future research. Yet benchmarking LLMs is becoming increasingly difficult: today’s models achieve exceptional scores on elite tasks such as Humanity’s Last Exam (HLE) and the International Mathematical Olympiad (IMO), reaching—even surpassing—PhD-level performance in language and textual reasoning.

But do multimodal LLMs (MLLMs) show similar expertise in vision? Meaningful evaluation requires disentangling visual ability from linguistic ability. When a visual task can be fully verbalized, it effectively becomes a language problem, allowing models to lean on strong textual reasoning rather than genuine visual understanding. Robust vision benchmarks must therefore be designed to minimize such linguistic shortcuts.

To rigorously assess models’ pure visual reasoning ability and compare it with human performance, we curate 20 vision-centric tasks—collectively referred to as BabyVision-Mini. We then evaluate models alongside children aged 3–12. The results are striking: even the strongest models perform at approximately the level of a three-year-old child. Gemini 3-Pro-Preview still lags typical six-year-old performance by roughly 20%, and other MLLMs fall below the average abilities of a three-year-old.

BabyVision: Performance comparison between MLLMs and human children

Pilot Study Result on BabyVision-Mini Dataset. We report the average pass@1 accuracy for 3 random runs for MLLMs and the average accuracy among human testers of different ages.

This striking gap compels us to ask: why do models that master PhD-level language struggle with 3-year-old-level vision? Specifically, which fundamental components of visual understanding are absent in current MLLMs? In response to these questions, we introduce BabyVision, a multi-modal benchmark at the starting point of human visual reasoning.

In the construction of the full benchmark, we split vision-centric reasoning into four core categories—Fine-grained Discrimination, Visual Tracking, Spatial Perception, and Visual Pattern Recognition—which together comprise 22 basic subtypes, each targeting a specific fundamental visual capability. We then employ a careful and rigorous data curation pipeline—including data collection, filtering, annotation, and cross-checking—to construct 388 questions spanning a wide diversity of visual reasoning tasks.

  • Fine-grained Discrimination — Detecting subtle visual differences (8 subtypes)
  • Visual Tracking — Following paths, lines, and trajectories (5 subtypes)
  • Spatial Perception — Understanding 3D structures and relationships (5 subtypes)
  • Visual Pattern Recognition — Identifying logical and geometric patterns (4 subtypes)
BabyVision-Full Results Table

Our philosophy is not to stump the model, but to measure the "atomic capabilities" of a model's visual reasoning—those fundamental tasks that are intuitive to humans but serve as the building blocks of visual intelligence.

Quantitative Results

We test BabyVision on leading open-source and proprietary MLLMs and human baselines. We use two versions: BabyVision-Mini (20 questions) and the full BabyVision (388 questions across 22 atomic types).

Children aged 3–12 (20+ per group, one school, consent obtained) take the Mini, while 16 adults complete the full benchmark.

Together, the evaluations ask: How old does the MLLM look? and Which visual primitives are missing?

BabyVision-Mini: Comparing Young Humans and Models

As our pilot study, BabyVision-Mini is built for meaningful developmental comparison. Its tasks are strictly vision-centric, minimizing language and prior-knowledge demands so that scores reflect visual reasoning rather than text-based inference. Its small size also makes it practical to complete within a single class period for young children.

Under this lens, the gap is striking as shown in the comparison figure (see Figure 1). Most frontier MLLMs perform well below the average 3-year-old, despite their PhD-level results on language benchmarks. Gemini3-Pro-Preview is the notable outlier—the only model consistently above the Age-3 band—yet it still lags typical 6-year-olds by ~20 points.

This highlights a core limitation: the issue is not solving “hard problems,” but struggling with pre-language visual primitives—the early perceptual and spatial abilities humans acquire before language becomes the main reasoning tool.

BabyVision-Full: A Full Capability Profile of MLLMs

In the full and fine-grained evaluaiton, models' best performance is still far from human-level (94.1%). ​Across closed-source systems, ​Gemini3-Pro-Preview leads overall (49.7%)​, followed by ​GPT-5.2 (34.4%)​ and Doubao-Seed-1.8 (30.2%)​, with other models substantially lower (e.g., ​Qwen3-VL-Plus ​19.2%​, Grok-4 16.2%, ​Claude-4.5-Opus 14.2%​).

Performance Radar Chart

These gaps relative to humans are consistent across categories: performance drops appear in all four families, not just one. This suggests current models lack foundational visual competencies overall—a systemic limitation, not an isolated weakness.

Performance Table

Performance (Pass@1) of Closed Source MLLMs on BabyVision. The best results for each question type are marked in bold. Reported values represent the average Pass@1 accuracy across three random runs, accompanied by the standard deviation. All models are in thinking mode with highest reasoning budget.

Gemini3-Pro vs. the Rest

Gemini3-Pro-Preview not only leads the overall leaderboard but is also strong across all four families, suggesting a more genuinely visual backbone than competing models. GPT-5.2 ranks second overall and leads in Visual Pattern Recognition, but trails Gemini on more perception-heavy tasks—especially fine-grained discrimination and visual tracking.

For open-source models, the best performer (Qwen3VL-235B-Thinking) reaches 22.2% overall. Two trends emerge. First, test-time “thinking” yields measurable gains: within Qwen3VL, the Thinking variant generally outperforms the Instruct variant (e.g., 22.2% vs. 19.5% at 235B), indicating that explicit intermediate reasoning can partly offset visual uncertainty once the signal is extracted. Second, scaling helps but saturates quickly: even the largest open model remains far below the best closed-source system, implying that more parameters or longer chains alone are insufficient—what’s missing likely relates to data and training paradigms that foster visual rather than text-dominant reasoning.

In short, today’s MLLMs pair strong language reasoning with immature visual foundations. BabyVision quantifies this mismatch and offers fine-grained diagnostics to guide progress toward truly grounded visual reasoning.

The "Unspeakable" Challenge in Visual Reasoning

Why do MLLMs fail at these seemingly simple tasks? The key insight is that these problems are "unspeakable"—they cannot be fully described in language without information loss. When models try to reason through text, they lose critical visual details.

The core problem: MLLMs try to compress visual reasoning into language tokens, but these tasks require direct perceptual processing that cannot be faithfully represented in text. We summarize 4 classic vision-centric challenges for current MLLMs observed during our evaluation.

Challenge 1: Observing Non-Verbal Details

Hexagon Puzzle
Fine-grained Discrimination / Pattern Completion
Find the missing piece that fits into the empty white space within the large hexagonal structure.
Answer: B

A pervasive weakness we observe across BabyVision is the loss of fine, non-verbal detail. When a solution depends on subtle visual cues—such as a tiny offset, a specific boundary curve, or a single-pixel difference—MLLMs often treat distinct choices as interchangeable. The issue is not logical difficulty, but a lack of high-fidelity perception.

Humans typically solve such tasks almost instantly through direct shape matching: mentally translating and rotating each candidate to check boundary alignment. This is a largely perceptual operation—continuous, parallel, and geometry-driven—without needing to name or describe anything.

MLLMs, by contrast, rely on implicit verbalization: (1) Verbalize the shape ("a hook at the top, two legs at the bottom"), (2) Reduce to coarse features (approximate counts, gross topology), (3) Compare candidates in language space. This compression becomes an information bottleneck—once fine structure is flattened into words, micro-differences become indistinguishable.

Core weakness: MLLMs struggle to preserve and manipulate fine spatial structure end-to-end. Even young children can reliably judge "fit" versus "mismatch" through direct visual comparison—this is a perception problem, not a reasoning problem.

Gemini3-Pro-Preview
Wrong observations are highlighted in red
The goal is to find the puzzle piece that fits into the empty white space within the large hexagonal structure. Let's break down the geometric constraints of the "hole" to determine the correct piece. 1. The Bottom "Island": Looking at the bottom center of the empty space, there is a single orange hexagon belonging to the main structure that points upward. This creates an obstruction in the middle of the base. * Constraint: The missing piece must have an indentation or arch at the bottom to straddle this "island" block. It needs two "legs" separated by a gap. * Observation: All four options (A, B, C, and D) feature this split-leg design, so we must look at other features. 2. The Right Flank (The "Neck"): Examine the right side boundary of the hole. Start from the very top peak and move down the right edge. Shortly below the top, the orange wall juts inward (to the left). * Constraint: The missing piece cannot have a solid block immediately to the bottom-right of its top "head" hexagon. It needs a "neck" or indentation on the right side to accommodate the wall's protrusion.
* Evaluation: * Option A: Has a hexagon sticking out to the right below the top. This would collide with the wall. * Option B: Also has a "shoulder" block sticking out to the right. This would collide. * Option C: Has a straight vertical right edge, which places a block where the wall protrudes. This would likely collide. * Option D: Below the top hexagon, the right side is empty (creating a narrow "neck"). This perfectly accommodates the jutting wall on the right.
3. The Left Flank (The "Arm"): Examine the left side boundary of the hole. There is a distinct "C"-shaped cutout in the wall. Specifically, below the top section, the wall recedes to the left, creating a deep pocket or void. * Constraint: The missing piece must have a protrusion or "arm" extending to the left to fill this pocket.
* Evaluation: * Option C: Is flat on the left side and lacks this protrusion. * Option B: Is relatively compact and doesn't have a distinct arm reaching out far enough to the left compared to its vertical axis. * Option D: Features a prominent hexagon sticking out to the left, which fits the "pocket" in the hole's left wall perfectly.
Conclusion: * Option D matches all the specific contours of the empty space. * It has the single top hexagon ("Head"). * It has the gap on the right side ("Neck") to avoid the wall protrusion. * It has the protrusion on the left side ("Arm") to fill the wall pocket. * It has the split legs at the bottom to fit over the base block. Therefore, the correct piece is D.
Model Answer: D
Ground Truth: B

Challenge 2: Manifold Understanding

Trash Bin Maze
Visual Tracking / Connect the Lines
Into which color bin would "plastic cup," " newspaper," and "apple core" go respectively in the picture? ("plastic cup," "newspaper," and "apple core" are A, B, C respectively; answer format is like: A-yellow, B-blue)
Answer: A-blue, B-yellow, C-green

Another failure mode we observe is loss of manifold identity: MLLMs struggle to maintain continuous identity of even a thin curve. When the answer is encoded in connectivity—not in object semantics—models often degrade from "following a line" to "guessing an endpoint."

Humans solve such tasks by visual tracking: they "lock onto" one curve and continuously follow it through crossings until it terminates. This is an early-acquired visual routine—the perceptual system performs contour integration and maintains "which line I am on" through intersections almost automatically, without naming intermediate steps.

For MLLMs, the core difficulty is that the answer is encoded in the connectivity of a 1D manifold embedded in 2D—a continuous curve that winds, overlaps, and self-intersects. The model tries to translate the curve into discrete instructions (left/right/up/down), but crossings create combinatorial branching. Without a faithful, persistent representation of the curve, the model easily "switches tracks" after a crossing—an error visually obvious to humans but difficult to detect once compressed into words.

Core weakness: MLLMs do not reliably maintain perceptual identity across extended spatial trajectories. Success depends on robust contour integration, continuity-preserving tracking, and resistance to interference from nearby curves—capabilities humans acquire effortlessly in early childhood.

Gemini3-Pro-Preview
Wrong observations are highlighted in red
Based on a careful tracing of the lines in the maze: 1. Plastic Cup (Item A): • Start at the dot below the plastic cup.
• Follow the line to the left. It curves down and forms the outermost path on the left side of the maze.
• It continues down to the bottom-left corner and turns right. • It runs horizontally along the very bottom edge of the puzzle, passing underneath the connections for the yellow and blue bins. • Finally, it curves up to connect to the dot above the green bin on the right. • Result: A-green 2. Apple Core (Item C): • Start at the dot below the apple core. • The line moves left, traversing the top edge of the maze. • It reaches the top-left corner and turns down, running vertically parallel to the cup's line (it is the second line from the left). • At the bottom of this vertical drop, it curves into the blue bin in the middle. • Result: C-blue 3. Old Newspaper (Item B): • Start at the dot below the newspaper. • The line goes left and immediately winds downwards through the center of the maze. • By elimination (and verifying the trace), it connects to the remaining bin. Tracing from the yellow bin (leftmost bin) upwards: the line goes right, loops up and left across the middle, then up and right to connect to the newspaper. • Result: B-yellow Answer: \boxed{A-green, B-yellow, C-blue}
Model Answer: A-green, B-yellow, C-blue
Ground Truth: A-blue, B-yellow, C-green

Challenge 3: Spatial Imagination

3D View
Spatial Perception / 3D Views
Which is the correct view when looking at the 3D structure from the direction indicated by the arrow?
Answer: A

A third pervasive bottleneck we observe is spatial imagination: the ability to construct a stable internal 3D representation from a 2D depiction, then mentally transform it (change viewpoint, project to silhouette, infer hidden volume) while preserving structural consistency. This skill is fundamental to human vision—children develop it early through play with blocks, drawings, and everyday navigation.

Humans solve such tasks by a brief act of imagination: they mentally view the object from the indicated direction and simply count or compare. Importantly, this is not a verbal process—people do not enumerate every element in language; they just hold the image in their mind and reason directly.

MLLMs, by contrast, translate the visual scene into a language summary before reasoning: (1) Approximate the viewpoint ("arrow points from lower right"), (2) Describe the object in words ("Analyze the Structure's Dimensions (Grid and Heights)"), (3) Guess the 2D features from coarse descriptions. The breakdown is that narration is not a faithful spatial state—once the precise image is compressed into a vague text summary, the model makes predictable errors: missing hidden blocks, miscounting layers, or applying wrong 3D projections.

Core weakness: MLLMs do not reliably "imagine" the 3D object. Spatial imagination—the ability to preserve structure while transforming it—is an ability humans acquire early through perception and interaction, but current MLLMs still rely on language logic, which is a poor substitute for simply holding a shape in mind.

Gemini3-Pro-Preview
Wrong observations are highlighted in red
To determine the correct view, let's analyze the 3D structure and the viewing direction indicated by the arrow. 1. Analyze the Viewing Direction: The red arrow originates from the bottom right and points towards the object. In technical drawing conventions for isometric projections: • This represents the Right Side View. • We need to project the object onto a 2D plane perpendicular to this viewing direction. • In this projected view: • The horizontal axis will represent the depth of the object, running from the "front" corner (closest to the viewer's left side) to the "back-right" corner (closest to the viewer's right side). • The vertical axis will represent the height of the block stacks. • Essentially, for every column from left to right in the view, we see the maximum height of the blocks in that "slice" of the object. 2. Analyze the Structure's Dimensions (Grid and Heights): Let's break down the object into a grid of columns. Based on the visual cues (vertical lines and proportions): • The grid appears to be 4 units wide along the viewing axis. Let's call these Column 1 to Column 4, from Left to Right in our 2D view. • Let's estimate heights in units of single blocks: • The Tower: The tallest part is the "L-shaped" wall at the back-left corner. Let's assign this a height of 4 units. This corresponds to the tallest columns in the options. • The Seat: The flat platform in front of the tower appears to be one unit lower than the tower top. Let's assign this a height of 3 units. (Alternatively, if the drop is 2 units, the seat is height 2, but let's look at the steps). • The Step: The rightmost column is one step lower than the seat. If the seat is 3, this is 2 units. 3. Determine the Height of Each Projected Column:Column 1 (Leftmost in view): • This corresponds to the "front" slice of the object (the corner closest to the bottom of the image). • Visible in this slice is the front of the seat stack (Height 3). • However, directly behind this stack (from the arrow's perspective) stands the tall tower. • Since the tower is taller (Height 4), it will be visible rising above the seat. • Therefore, the silhouette for this column will have a height of 4. • Column 2: • This corresponds to the next slice to the right. • Similar to the first column, there is a seat stack in front (Height 3). • Behind it, the tower continues (the tower is 2 blocks wide). The tower's height is 4. • Therefore, the silhouette for this column will also have a height of 4. • Column 3: • This corresponds to the next slice. • Here we see the seat stack (Height 3). • Is there anything behind it? Looking at the inner corner of the "L", the tall tower wall ends after the second column. There is no tall wall behind column 3. • Therefore, the maximum visible height is just the seat itself, which is 3. • Column 4 (Rightmost in view): • This corresponds to the far right slice. • This stack is clearly stepped down from the main seat level.
• Its height is 2. • There is nothing taller behind it. • Therefore, the silhouette height is 2.
4. Match with Options: We are looking for a 2D histogram-like shape with column heights (from left to right): 4, 4, 3, 2. • Option A: Shows relative heights High, High, Low, High (4, 4, 2, 4). The last column is tall, which contradicts the object's shape (the right end is the lowest point). • Option B: Shows relative heights Medium, High, Low, Low (3, 4, 2, 2). The first column is shown as shorter than the second. This is incorrect because the tall tower stands behind the first column, making its silhouette full height. • Option C: Shows relative heights High, High, Medium, Low (4, 4, 3, 2). This matches our analysis perfectly: two tall columns for the tower section, stepping down to the seat, and stepping down again to the end. Therefore, the correct view is C.
Model Answer: C
Ground Truth: A

Challenge 4: Visual Pattern Induction

Pattern Completion
Pattern Recognition / Visual Reasoning
Among the three given options, choose the most suitable one to fill in the question mark, so that it shows a certain pattern.
Answer: C

A fourth challenge we repeatedly observe is visual pattern induction: the ability to abstract a generalized transformation rule from a few visual examples and apply it to a new input. Humans typically handle such problems by comparing visual example pairs directly, constructing a small causal graph: which shape contains which, which element is the frame, and how these roles are reassigned from input to output.

The key human ability is to see relational rules (what changed) rather than object attributes (what is there). Whether the pattern involves rotation, swap, reflection, or containment—humans extract the abstract transformation and apply it to novel inputs. The specific shapes, colors, or positions do not matter; only their roles in the transformation do.

MLLMs, by contrast, approach such problems through attribute counting rather than relational mapping. Instead of seeing an abstract operation, they rely on semantic description: describe the source, describe the target, attempt to bridge via text. This approach fails because the model often hallucinates rules based on surface features (e.g., "...two green, two brown, and four orange segments") rather than structural logic. The model focuses on objects as fixed entities rather than elements in a transformation sequence.

Core weakness: MLLMs often mix up appearance with structure. Pattern induction requires ignoring specific visual elements to see the abstract pattern. Success in these tasks requires abstract reasoning over visual relations—a step beyond simple recognition that remains a significant hurdle for current architectures.

Qwen3-VL-Plus
Wrong observations are highlighted in red
To determine the correct answer, we analyze the pattern in the sequence of three octagons provided at the top of the image. Each octagon is composed of 8 equal segments, with two green, two brown, and four orange segments. Step 1: Identify the Movement Pattern We focus on the brown segments, which are always two adjacent segments. Observing their positions across the three figures: • First octagon: Brown segments are on the right side (positions 3–4). • Second octagon: Brown segments have moved to the bottom (positions 5–6). • Third octagon: Brown segments are now on the left side (positions 7–8). This indicates that the brown block moves two positions counterclockwise with each step. Step 2: Predict the Next Position Following this pattern: • From right (3–4)bottom (5–6)left (7–8), • The next logical position is top (1–2). So, the next figure should have the brown segments at the top (two adjacent segments). Step 3: Evaluate the Options Now, we examine the answer choices to find the one where the brown segments are at the top (i.e., two adjacent segments at the top of the octagon): • Option A: Brown segments are at the top right and bottom — not at the top. • Option B: Brown segments are at the top (two adjacent), and green is on the left — this matches the expected pattern. • Option C: Brown segments are at the top left and bottom — again, not fully at the top. Only Option B has the brown segments correctly placed at the top, consistent with the observed movement pattern. Conclusion The pattern shows the brown segments moving two positions counterclockwise in each step. Based on this, the next figure should have the brown segments at the top, which is only true for Option B. [B]
Model Answer: B
Ground Truth: C

Insights from Training: How to Achieve Better Result on BabyVision.

We have identified large performance gaps in BabyVision, not only among human and frontier models, but also across closed- and open-source models. We are further interested in: how can we develop stronger visual reasoning skills and achieve better scores on BabyVision with open models?

As Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated strong gains in language-reasoning performance for LLMs, we conduct a preliminary study to investigate whether RLVR can similarly improve visual abilities measured by BabyVision. We use Qwen3-VL-8B-Thinking as the base model and apply RLVR fine-tuning. For data collection, we adopt a BabyVision-style pipeline but draw from larger image sources and remove duplicates, yielding 1,400 training examples. The collected data covers all four major BabyVision task families, yet its difficulty distribution is not completely aligned with BabyVision: the model achieves 34.2% inital accuracy on the RLVR training set, but only 13.1% on BabyVision, when evaluated with the same base model.

We fine-tune Qwen3-VL-8B-Thinking for 450 steps using the GRPO algorithm. We observe that RLVR is effective on the collected training dataset: both training accuracy and held-out test accuracy consistently improve over the course of training.

Train_Curves

The BabyVision performance of Qwen3-VL-8B-Thinking before and after RL fine-tuning is reported in the following table. The model achieves a +4.8-point overall accuracy improvement after RLVR training. We also observe consistent gains across most task subtypes, with the sole exception of visual tracking, for which RL fine-tuning yields little to even negative improvement. We hypothesize that this is because visual tracking is the least amenable to verbalization; since RLVR primarily enhances performance by encouraging longer and more structured “thinking-token” reasoning, it provides less benefit on tasks that depend on continuous perceptual tracking rather than language-mediated reasoning.

Category Before RL After RL Improvement
Overall 13.1 17.9 +4.8
Fine-grained Discrimination 12.7 19.4 +6.8
Visual Tracking 10.8 9.6 -1.2
Spatial Perception 15.0 20.9 +5.9
Visual Pattern Recognition 15.0 20.9 +5.9

Beyond VLMs: Can Generation Help Reasoning?

If text-based reasoning proves insufficient, a natural question arises: can visual generation bridge this gap? Rather than describing solutions in words, could models draw the answer—mirroring how children intuitively point to, trace, or mark solutions when reasoning visually?

Motivated by this insight, we introduce BabyVision-Gen, a generative extension of BabyVision that evaluates whether image- and video-generation models can perform visual reasoning through visual outputs. BabyVision-Gen comprises 280 questions re-annotated from the original benchmark to support generation-based evaluation, where correctness can be directly and unambiguously verified by comparing the model-generated outputs with human-drawn ground-truth solutions. We also develop an automatic evaluation tool for generation models that achieves a 0.96 agreement with human evaluators.

Generation Models Show a Promising Direction

Naturally, generative models introduce a new test-time scaling dimension for visual reasoning, enabling explicit image manipulation, intermediate markups, and hint drawing during the reasoning process. We evaluate BabyVision-Gen on several state-of-the-art image and video generation models, including ​Nano-Banana-Pro​, GPT-1.5-Image, ​Qwen-Image-Edit-2511​, ​Veo-3​, and ​Sora-2​.

Category Nano-Banana-Pro GPT-1.5-Image Qwen-Image-Edit-2511
Overall 18.3 9.8 4.8
Fine-grained Discrimination 24.5 9.6 4.7
Visual Tracking 6.7 2.4 0.0
Spatial Perception 13.0 12.4 7.3
Visual Pattern Recognition 22.8 16.7 7.9

Our results reveal that video generation could serve as a paradigm for multimodal reasoning on tasks that remain challenging for vision–language models (VLMs) as illustrated below. However, despite these encouraging behaviors, current generative models still struggle to consistently arrive at fully correct solutions in most cases as shown in the performance table above. Nevertheless, these findings point to a compelling direction similar to "Video models are zero-shot learners and reasoners": ​video generation models hold strong potential to evolve into well-rounded multimodal reasoners​, especially when visual reasoning is grounded in explicit visual manipulation rather than language alone.

Task: Draw a red line to trace the complete line extending from the top left figure.

Sora-2:

NanoBanana-Pro:

NanoBanana-Pro output

From the generated outputs in the visual tracking task, we observe that these two models exhibit the most human-like visual thinking processes, explicitly drawing trajectories along the paths in the image. However, despite this alignment with human behavior, their generations still contain noticeable errors, indicating that further alignment is required.

Nevertheless, these findings point to a clear and compelling direction: video generation models hold strong potential to evolve into well-rounded multimodal reasoners, especially when visual reasoning is grounded in explicit visual manipulation rather than language alone.

Cases for Generative Visual Reasoning

Below we compare multiple generation models on BabyVision-Gen tasks:

Fine-grained Discrimination / Find the Different
Which pattern in the figure differs in shape from the others? (H vs 4)
Put a red circle on the unique element in the picture
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
Visual Tracking / Maze
The diagram is a maze; find which entrance connects to the exit.
The entrance at A is the right one to the exit. Please draw in red lines showing the way out of the maze from the entrance to the exit.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
Visual Tracking / Connect the Lines
Three animal patterns are above, three environmental patterns below. Which animal goes where?
Draw a red line to trace the complete line extending from the top left figure.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
More Examples
Visual Tracking / Recognize Numbers
Four uppercase letters or Arabic numerals in different colors. Which are they?
The four blanks below are traced by different colors. Fill in the blanks in black respectively with the letter or number who has the same color as the blanks indicate.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
Spatial Perception / 3D Views
From the three options, select the top view of the solid figure.
Put a red circle on the option corresponding to the top view.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
Spatial Perception / 3D Views
Connect the three-dimensional figure with its corresponding top view.
Use red lines to connect the solid figures at the top to their corresponding top views at the bottom with lines respectively.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
Visual Tracking / Metro Map
How many stations minimum must be passed through when traveling from one station to another?
Use a black line to trace the shortest path between the two stations.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
Fine-grained Discrimination / Find the Shadow
Find the shadow that exactly matches the middle penguin pattern.
Put a red circle on the shadow that exactly matches the penguin.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video
Fine-grained Discrimination / 2D Pattern Completion
From the given three options, select the most appropriate one to fill in the blank.
Put a red circle on the option that completes the pattern.
Input
Input
Ground Truth GT
Ground Truth
NanoBanana-Pro Image
NanoBanana-Pro
Qwen-Image-Edit-2511 Image
Qwen-Image-Edit-2511
GPT-Image-1.5 Image
GPT-Image-1.5
Sora-2 Video
Veo3 Video
Wan2.2 Video

Conclusion

BabyVision reveals a striking truth: currently MLLMs do not have robust foundational visual competence even compared to children. Despite their impressive performance on language-heavy and expert-level benchmarks, today's MLLMs still struggle with the pre-linguistic visual primitives that humans acquire in early childhood.

By decomposing visual intelligence into atomic capabilities and benchmarking them independently of language, BabyVision exposes where current models fall short and why scaling language alone is insufficient. Our results further suggest that visual generation—reasoning by drawing, tracing, and manipulating images—offers a promising path forward, partially recovering capabilities that text-based reasoning cannot express.

These atomic visual abilities are also critical for embodied AI: it is difficult to imagine a robot with visual competence below that of a three-year-old being able to reliably assist humans in the physical world. BabyVision provides a diagnostic lens and a research direction: to advance multimodal intelligence, future models must rebuild vision from the ground up rather than linguistic shortcuts.

Acknowledgements

We would like to thank Xiaotao Gu (Zhipu AI), Junyang Lin (Alibaba Group), Shuai Bai (Alibaba Group) and Shuhuai Ren (Xiaomi MiMo) for their valuable discussions and insightful feedbacks throughout this project.

Citation

For details of BabyVision, please read our paper. If you find it useful in your research, please kindly cite:

@article{babyvision2026,
  title={BabyVision: Visual Reasoning Beyond Language},
  year={2026}
}