Benchmarks

BabyVision

🏆 Gemini3-Pro: 49.7% · Visual Reasoning

Performance Timeline

Open
Proprietary
Human (94.1%)
Performance (%)
100
80
60
40
20
0
Apr 2025
Jun 2025
Aug 2025
Oct 2025
Dec 2025
Release Date
Human 94.1%
Kimi-VL
KimiVL-A3B
Moonshot AI · Apr 10, 2025
12.4%
MiMo-VL
MimoVL-7B-RL
Xiaomi · Apr 23, 2025
15.1%
Grok-4
Grok-4
xAI · Jul 9, 2025
16.2%
Step3
Step3
StepFun · Jul 31, 2025
14.7%
InternVL3.5
InternVL3.5-241B
OpenGVLab · Aug 26, 2025
19.2%
Qwen3-VL
Qwen3VL-235B-Thinking
Alibaba · Sep 23, 2025
22.2%
Qwen3-VL-Plus
Qwen3-VL-Plus
Alibaba · Sep 23, 2025
19.2%
Gemini 3 Pro
Gemini3-Pro-Preview
Google · Nov 18, 2025
49.7%
Claude 4.5
Claude-4.5-Opus
Anthropic · Nov 24, 2025
14.2%
GLM-4.6V
GLM4.6V
Zhipu AI · Dec 8, 2025
17.6%
GPT-5.2
GPT-5.2
OpenAI · Dec 11, 2025
34.4%
Doubao-1.8
Doubao-1.8
ByteDance · Dec 18, 2025
30.2%

Hover over data points for details. Blue = Open, Orange = Proprietary. Dashed line = Human baseline (94.1%).

Can MLLMs See Like a 3-Year-Old?

Complete Benchmarks

Rank Model Type Company Release Date Score Progress
- HumanBASELINE Human - - 94.1%
1 Gemini3-Pro-Preview Proprietary Google Nov 18, 2025 49.7%
2 GPT-5.2 Proprietary OpenAI Dec 11, 2025 34.4%
3 Doubao-1.8 Proprietary ByteDance Dec 18, 2025 30.2%
4 Qwen3VL-235B-Thinking Open Alibaba Sep 23, 2025 22.2%
5 InternVL3.5-241B Open OpenGVLab Aug 26, 2025 19.2%
5 Qwen3-VL-Plus Proprietary Alibaba Sep 23, 2025 19.2%
7 GLM4.6V Open Zhipu AI Dec 8, 2025 17.6%
8 Grok-4 Proprietary xAI Jul 9, 2025 16.2%
9 MimoVL-7B-RL Open Xiaomi Apr 23, 2025 15.1%
10 Step3 Open StepFun Jul 31, 2025 14.7%
11 Claude-4.5-Opus Proprietary Anthropic Nov 24, 2025 14.2%
12 KimiVL-A3B Open Moonshot AI Apr 10, 2025 12.4%

Citation

For details of BabyVision, please read our paper. If you find it useful in your research, please kindly cite:

@article{babyvision2026,
  title={BabyVision: Visual Reasoning Beyond Language},
  year={2026}
}