AI Model Leaderboard

Compare AI models across agentic, reasoning, coding, and tool-use benchmarks.

Agentic Reasoning Coding Tool Use Computer Use

Benchmarks

agentic

reasoning

coding

tool-use

computer-use

Open LLM Leaderboard v2

HuggingFace's open-source model leaderboard. Evaluates on IFEval, BBH, MATH Lvl 5, GPQA Diamond, MUSR, and MMLU-Pro. The standard reference for open-source model rankings.

Reasoning7 models · avg score

Qwen2.5-72B-Instruct72B

47.98avg score 2

Mistral-Large-Instruct-2411123B

46.52avg score 3

Llama-3.3-70B-Instruct70B

44.85avg score 4

Qwen2-72B-Instruct72B

43.59avg score 5

Llama-3.1-70B-Instruct70B

43.41avg score 6

Qwen2.5-14B-Instruct14B

41.31avg score 7

Qwen2.5-Coder-32B-Instruct32B

39.89avg score