AI Model Leaderboard

Compare AI models across agentic, reasoning, coding, and tool-use benchmarks.

Agentic Reasoning Coding Tool Use Computer Use

Benchmarks

agentic

reasoning

coding

tool-use

computer-use

GAIA

466 human-curated multi-step tasks requiring web browsing, tool use, and file parsing. The gold standard for real-world agent capability.

Agentic49 models · % accuracy

GPT-5.4Undisclosed

92.36% accuracy 2

Llama-3.1-Nemotron-70B-Instruct-HF70B

87.38% accuracy 3

Gemini 2.5 ProUndisclosed

87.04% accuracy 4

Claude Sonnet 4.6Undisclosed

83.72% accuracy 5

GPT-4.1Undisclosed

83.06% accuracy 6

Llama-3.1-70B-Instruct70B

Gemini 3 ProUndisclosed

78.07% accuracy 10

Llama-3.3-70B-Instruct70B

Claude Opus 4.6Undisclosed

74.09% accuracy 16

Gemini 1.5 ProUndisclosed

73.09% accuracy 17

DeepSeek Reasoner (V3.2)671B

Qwen2.5-72B-Instruct72B

Gemini 2.5 FlashUndisclosed

58.80% accuracy 24

DeepSeek-R1-Distill-Qwen-32B32B

57.48% accuracy 25

Grok 4.20Undisclosed

55.48% accuracy