AI Model Leaderboard

Compare AI models across agentic, reasoning, coding, and tool-use benchmarks.

Agentic Reasoning Coding Tool Use Computer Use

Benchmarks

agentic

reasoning

coding

tool-use

computer-use

METR Time Horizon

Measures how long AI agents can work autonomously on real software tasks without human intervention. A unique capability axis — tracks the frontier of autonomous AI work duration over time.

Agentic9 models · minutes

Claude Opus 4.6Undisclosed

718.80minutes 2

GPT-5.4Undisclosed

352.20minutes 3

Gemini 3 ProUndisclosed

224.30minutes 4

o3Undisclosed

119.70minutes 5

Claude Sonnet 4.6Undisclosed