AI Model Leaderboard

Compare AI models across agentic, reasoning, coding, and tool-use benchmarks.

Agentic Reasoning Coding Tool Use Computer Use

Benchmarks

agentic

reasoning

coding

tool-use

computer-use

Humanity's Last Exam

2500+ expert-level questions across 100+ fields (math, science, humanities), crowd-sourced from domain experts. The hardest static benchmark — designed to be unsolvable by current AI, with top models scoring ~75%.

Reasoning19 models · % accuracy

Gemini 3.1 Pro PreviewUndisclosed

45% accuracy 2

Gemini 3 ProUndisclosed

38.30% accuracy 3

Claude Opus 4.6Undisclosed

Qwen3.6-PlusUndisclosed

Gemini 2.5 ProUndisclosed

21.60% accuracy 13

gemma-4-31b30.7B

19.50% accuracy 14

GPT-5.4 MiniUndisclosed

19.40% accuracy 15

Claude Sonnet 4.6Undisclosed

13.70% accuracy 16

Gemini 2.5 FlashUndisclosed

12.10% accuracy 17

gemma-4-26b-a4b25.2B total / 3.8B active