AI Model Leaderboard
Compare AI models across agentic, reasoning, coding, and tool-use benchmarks.
Benchmarks
agentic
reasoning
coding
tool-use
computer-use
CORE-Bench Hard
Reproducing computational research papers end-to-end — coding, running experiments, verifying results. Hard difficulty requires full reproduction with correct output. Scientific programming benchmark from HAL.
Coding2 models · % reproduced