Comparez les modèles IA sur les benchmarks agentiques, de raisonnement, de codage et d'utilisation d'outils.
agentic
reasoning
coding
tool-use
computer-use
Reproducing computational research papers end-to-end — coding, running experiments, verifying results. Hard difficulty requires full reproduction with correct output. Scientific programming benchmark from HAL.