, Last update: Monday 8/11/25
When OpenAI released its GPT-5 model last week, its CEO Sam Altman boasted about GPT 5's scores on benchmark tests. In an effort to place Altman's boasts into a broader context, the editor of this blog constructed a "Leaderboard" of GPT 5's benchmark scores vs the scores of other leading genAI models.
... Skip to Leaderboard ➡ HERE
Development process
The editor engaged ChatGPT on GPT 5 to generate the first draft, which he submitted to Claude on Sonnet 4 for its corrections and suggested modifications. When ChatGPT completed its second draft, he asked Claude for another round of corrections and suggested modifications. The editor's own contributuions focused on the consistency of the drafts and their readability. The original final draft was comprehensive, impressive ... and unreadable by humans.
The editor then "vibe coded" a spec that he wanted the table to "pop out" into its own full screen window so that its data would not be "scrunched up" into multi-line hypenated entries. To his delight, ChatGPT came up with some obscure CSS code that delivered a scrollable pop-out table that was readable on a smartphone.
Indeed, the table was so clear that the editor easily spotted a few more glaring inconsistencies and probable errors. Claude's final corrections and modifications yielded the final version of the table.
Our readers are strongly advised to click the link that displays a scrollable copy of the table in a window that hovers over the blog page. We welcome your corrections and suggested modifications.
With a little help from our readers and from our chatbot "assistants", we will endeavor to keep this leaderboard up-to-date... 😎
Click HERE to view a more readable, scrollable, "unscrunched" table in its own window
LLM Benchmarks Leaderboard ... Only verified scores
Last updated: Aug 10, 2025 • Scores are % unless noted; “N/R” = not reported • Podium coloring: green = 1st, yellow = 2nd, pink = 3rd; no color if below 3.
Category | Benchmark | Protocol / Measure | GPT-5 | Claude 4.1 | Gemini 2.5 Pro | Grok 4 | DeepSeek R1 | Top-3 |
---|---|---|---|---|---|---|---|---|
Core Reasoning & Math | AIME 2025 | no tools | 94.6 | 75.5 | 88.0 | N/R | 87.5 | 1) GPT-5 • 2) Gemini • 3) DeepSeek |
Core Reasoning & Math | GPQA Diamond | pass@1 | 89.4 | 80.9 | 86.4 | 88.9 | 81.0 | 1) GPT-5 • 2) Grok • 3) Gemini |
Core Reasoning & Math | ARC-AGI-2 | abstract reasoning | 9.9 | 8.6 | 15.9 | 15.9 | N/R | 1) Gemini • 1) Grok • 3) GPT-5 |
Core Reasoning & Math | Humanity’s Last Exam | no tools | 42.0 (Pro) | 11.5 | 21.64 | 44.4 | 14.04 | 1) Grok • 2) GPT-5 • 3) Gemini |
Language Understanding | MMLU | 5-shot | 90.2 | 85.8 | 85.8 | N/R | N/R | 1) GPT-5 • 2) Claude/Gemini (tie) |
Coding | SWE-bench Verified | single attempt | 74.9 | 74.5 | 59.6 | N/R | N/R | 1) GPT-5 • 2) Claude • 3) Gemini |
Coding | SWE-bench Verified | multi attempt | N/R | 79.4 | 67.2 | N/R | 57.6 | 1) Claude • 2) Gemini • 3) DeepSeek |
Coding | Aider Polyglot | pass@1 | 88.0 | 72.0 | 82.2 | N/R | 71.6 | 1) GPT-5 • 2) Gemini • 3) Claude |
Coding | LiveCodeBench | Jan–May 2025 | 72.0 | 51.1 | 69.0 | N/R | 70.5 | 1) GPT-5 • 2) DeepSeek • 3) Gemini |
Multimodal | MMMU | single attempt | 84.2 | 76.5 | 82.0 | N/R | N/R | 1) GPT-5 • 2) Gemini • 3) Claude |
Multimodal | VideoMME | video understanding | N/R | N/R | 84.8 | N/R | N/R | 1) Gemini • 2) — • 3) — |
Agent & Tool Use | Tau-bench (airline) | agent navigation | 63.5 | N/R | N/R | N/R | N/R | 1) OpenAI o3 • 2) GPT-5 • 3) — |
Agent & Tool Use | Tau-bench (retail) | agent navigation | 81.1 | 82.4 | N/R | N/R | N/R | 1) Claude • 2) GPT-5 • 3) — |
Specialized | HealthBench (Hard) | pass@1 | 46.2 | N/R | N/R | N/R | N/R | 1) GPT-5 • 2) — • 3) — |
Long-Context & Factuality | MRCR v2 | 128k avg / 1M pointwise | N/R | N/R | 58.0 / 16.4 | N/R | N/R | 1) Gemini (reported) |
Long-Context & Factuality | SimpleQA | pass@1 | N/R | N/R | 54.0 | N/R | N/R | 1) Gemini |
Long-Context & Factuality | FACTS grounding | grounding score | N/R | 77.7 | 87.8 | N/R | N/R | 1) Gemini • 2) Claude |
Safety & Robustness | StrongREJECT (jailbreak) | lower is better | N/R | 2.24–6.71% | — | — | — | 1) Claude |
Safety & Robustness | BBQ (bias) | bias ↓ / accuracy ↑ | N/R | 0.21 / 99.8 | — | — | — | 1) Claude |
Sources
- Humanity’s Last Exam leaderboard [oai_citation:0‡Wikipedia](https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam?utm_source=chatgpt.com)
- GPT-5 details & ARC-AGI-2 mention [oai_citation:1‡Tom's Guide](https://www.tomsguide.com/news/live/openai-chatgpt-5-live-blog?utm_source=chatgpt.com) [oai_citation:2‡For Every Scale](https://www.foreveryscale.com/p/gpt-5-vs-rivals-the-real-score?utm_source=chatgpt.com)
- Grok 4 ARC-AGI-1 results [oai_citation:3‡THE DECODER](https://the-decoder.com/grok-4-edges-out-gpt-5-in-complex-reasoning-benchmark-arc-agi/?utm_source=chatgpt.com)
- GPT-5 vs Gemini deep prompting takeaways [oai_citation:4‡Tom's Guide](https://www.tomsguide.com/ai/chatgpt/i-tested-chatgpt-vs-gemini-2-5-pro-with-these-3-prompts-and-it-shows-what-gpt-5-needs-to-do?utm_source=chatgpt.com)
- GSM8K benchmark info [oai_citation:5‡klu.ai](https://klu.ai/glossary/GSM8K-eval?utm_source=chatgpt.com)
- ARC-AGI-2 benchmark paper [oai_citation:6‡arxiv.org](https://arxiv.org/abs/2505.11831?utm_source=chatgpt.com)
- System card for Claud Opus 4.1)
- System card for GPT 5
Full Table (Close the window by clicking anywhere near to, but outside the window.)
Category | Benchmark | Protocol / Measure | GPT-5 | Claude 4.1 | Gemini 2.5 Pro | Grok 4 | DeepSeek R1 | Top-3 |
---|---|---|---|---|---|---|---|---|
Core Reasoning & Math | AIME 2025 | no tools | 94.6 | 75.5 | 88.0 | N/R | 87.5 | 1) GPT-5 • 2) Gemini • 3) DeepSeek |
Core Reasoning & Math | GPQA Diamond | pass@1 | 89.4 | 80.9 | 86.4 | 88.9 | 81.0 | 1) GPT-5 • 2) Grok • 3) Gemini |
Core Reasoning & Math | ARC-AGI-2 | abstract reasoning | 9.9 | 8.6 | 15.9 | 15.9 | N/R | 1) Gemini • 1) Grok • 3) GPT-5 |
Core Reasoning & Math | Humanity’s Last Exam | no tools | 42.0 (Pro) | 11.5 | 21.64 | 44.4 | 14.04 | 1) Grok • 2) GPT-5 • 3) Gemini |
Language Understanding | MMLU | 5-shot | 90.2 | 85.8 | 85.8 | N/R | N/R | 1) GPT-5 • 2) Claude/Gemini (tie) |
Coding | SWE-bench Verified | single attempt | 74.9 | 74.5 | 59.6 | N/R | N/R | 1) GPT-5 • 2) Claude • 3) Gemini |
Coding | SWE-bench Verified | multi attempt | N/R | 79.4 | 67.2 | N/R | 57.6 | 1) Claude • 2) Gemini • 3) DeepSeek |
Coding | Aider Polyglot | pass@1 | 88.0 | 72.0 | 82.2 | N/R | 71.6 | 1) GPT-5 • 2) Gemini • 3) Claude |
Coding | LiveCodeBench | Jan–May 2025 | 72.0 | 51.1 | 69.0 | N/R | 70.5 | 1) GPT-5 • 2) DeepSeek • 3) Gemini |
Multimodal | MMMU | single attempt | 84.2 | 76.5 | 82.0 | N/R | N/R | 1) GPT-5 • 2) Gemini • 3) Claude |
Multimodal | VideoMME | video understanding | N/R | N/R | 84.8 | N/R | N/R | 1) Gemini • 2) — • 3) — |
Agent & Tool Use | Tau-bench (airline) | agent navigation | 63.5 | N/R | N/R | N/R | N/R | 1) OpenAI o3 • 2) GPT-5 • 3) — |
Agent & Tool Use | Tau-bench (retail) | agent navigation | 81.1 | 82.4 | N/R | N/R | N/R | 1) Claude • 2) GPT-5 • 3) — |
Specialized | HealthBench (Hard) | pass@1 | 46.2 | N/R | N/R | N/R | N/R | 1) GPT-5 • 2) — • 3) — |
Long-Context & Factuality | MRCR v2 | 128k avg / 1M pointwise | N/R | N/R | 58.0 / 16.4 | N/R | N/R | 1) Gemini (reported) |
Long-Context & Factuality | SimpleQA | pass@1 | N/R | N/R | 54.0 | N/R | N/R | 1) Gemini |
Long-Context & Factuality | FACTS grounding | grounding score | N/R | 77.7 | 87.8 | N/R | N/R | 1) Gemini • 2) Claude |
Safety & Robustness | StrongREJECT (jailbreak) | lower is better | N/R | 2.24–6.71% | N/R | N/R | N/R | 1) Claude |
Safety & Robustness | BBQ (bias) | bias ↓ / accuracy ↑ | N/R | 0.21 / 99.8 | N/R | N/R | N/R | 1) Claude |
No comments:
Post a Comment
Your comments will be greatly appreciated ... Or just click the "Like" button above the comments section if you enjoyed this blog note.