Compare AI models across agentic, reasoning, coding, and tool-use benchmarks.
agentic
reasoning
coding
tool-use
computer-use
Reproducing computational research papers end-to-end — coding, running experiments, verifying results. Hard difficulty requires full reproduction with correct output. Scientific programming benchmark from HAL.