AI Model Leaderboard

Compare AI models across agentic, reasoning, coding, and tool-use benchmarks.

Agentic Reasoning Coding Tool Use Computer Use

Benchmarks

agentic

reasoning

coding

tool-use

computer-use

SWE-bench Verified

500 human-verified real GitHub issues from open-source repos. Agents must write code that passes existing test suites end-to-end. The definitive coding agent benchmark.

Coding12 models · % resolved

Claude Opus 4.6Undisclosed

76.80% resolved 2

Gemini 3 FlashUndisclosed

Claude Sonnet 4.6Undisclosed

71.40% resolved 7

Kimi-K2-Instruct-09051T

Gemini 3 ProUndisclosed

69.60% resolved 11

Claude Haiku 4.5Undisclosed

66.60% resolved 12

GPT-5.4 MiniUndisclosed

56.20% resolved