Benchmarks

, Last update: Monday 8/11/25
When OpenAI released its GPT-5 model last week, its CEO Sam Altman boasted about GPT 5's scores on benchmark tests. In an effort to place Altman's boasts into a broader context, the editor of this blog constructed a "Leaderboard" of GPT 5's benchmark scores vs the scores of other leading genAI models.

... Skip to Leaderboard  HERE 


Development process
The editor engaged ChatGPT on GPT 5 to generate the first draft, which he submitted to Claude on Sonnet 4 for its corrections and suggested modifications. When ChatGPT completed its second draft, he asked Claude for another round of corrections and suggested modifications. The editor's own contributuions focused on the consistency of the drafts and their readability. The original final draft was comprehensive, impressive ... and unreadable by humans. 

The editor then "vibe coded" a spec that he wanted the table to "pop out" into its own full screen window so that its data would not be "scrunched up" into multi-line hypenated entries. To his delight, ChatGPT came up with some obscure CSS code that delivered a scrollable pop-out table that was readable on a smartphone.

Indeed, the table was so clear that the editor easily spotted a few more glaring inconsistencies and probable errors. Claude's final corrections and modifications yielded the final version of the table.

Our readers are strongly advised to click the link that displays a scrollable copy of the table in a window that hovers over the blog page. We welcome your corrections and suggested modifications.

With a little help from our readers and from our chatbot "assistants", we will endeavor to keep this leaderboard up-to-date... 😎


Click HERE to view a more readable, scrollable, "unscrunched" table in its own window


LLM Benchmarks Leaderboard ... Only verified scores

Last updated: Aug 10, 2025 • Scores are % unless noted; “N/R” = not reported • Podium coloring: green = 1st, yellow = 2nd, pink = 3rd; no color if below 3.

Category Benchmark Protocol / Measure GPT-5 Claude 4.1 Gemini 2.5 Pro Grok 4 DeepSeek R1 Top-3
Core Reasoning & Math AIME 2025 no tools 94.6 75.5 88.0 N/R 87.5 1) GPT-5 • 2) Gemini • 3) DeepSeek
Core Reasoning & Math GPQA Diamond pass@1 89.4 80.9 86.4 88.9 81.0 1) GPT-5 • 2) Grok • 3) Gemini
Core Reasoning & Math ARC-AGI-2 abstract reasoning 9.9 8.6 15.9 15.9 N/R 1) Gemini • 1) Grok • 3) GPT-5
Core Reasoning & Math Humanity’s Last Exam no tools 42.0 (Pro) 11.5 21.64 44.4 14.04 1) Grok • 2) GPT-5 • 3) Gemini
Language Understanding MMLU 5-shot 90.2 85.8 85.8 N/R N/R 1) GPT-5 • 2) Claude/Gemini (tie)
Coding SWE-bench Verified single attempt 74.9 74.5 59.6 N/R N/R 1) GPT-5 • 2) Claude • 3) Gemini
Coding SWE-bench Verified multi attempt N/R 79.4 67.2 N/R 57.6 1) Claude • 2) Gemini • 3) DeepSeek
Coding Aider Polyglot pass@1 88.0 72.0 82.2 N/R 71.6 1) GPT-5 • 2) Gemini • 3) Claude
Coding LiveCodeBench Jan–May 2025 72.0 51.1 69.0 N/R 70.5 1) GPT-5 • 2) DeepSeek • 3) Gemini
Multimodal MMMU single attempt 84.2 76.5 82.0 N/R N/R 1) GPT-5 • 2) Gemini • 3) Claude
Multimodal VideoMME video understanding N/R N/R 84.8 N/R N/R 1) Gemini • 2) — • 3) —
Agent & Tool Use Tau-bench (airline) agent navigation 63.5 N/R N/R N/R N/R 1) OpenAI o3 • 2) GPT-5 • 3) —
Agent & Tool Use Tau-bench (retail) agent navigation 81.1 82.4 N/R N/R N/R 1) Claude • 2) GPT-5 • 3) —
Specialized HealthBench (Hard) pass@1 46.2 N/R N/R N/R N/R 1) GPT-5 • 2) — • 3) —
Long-Context & Factuality MRCR v2 128k avg / 1M pointwise N/R N/R 58.0 / 16.4 N/R N/R 1) Gemini (reported)
Long-Context & Factuality SimpleQA pass@1 N/R N/R 54.0 N/R N/R 1) Gemini
Long-Context & Factuality FACTS grounding grounding score N/R 77.7 87.8 N/R N/R 1) Gemini • 2) Claude
Safety & Robustness StrongREJECT (jailbreak) lower is better N/R 2.24–6.71% 1) Claude
Safety & Robustness BBQ (bias) bias ↓ / accuracy ↑ N/R 0.21 / 99.8 1) Claude

Sources

No comments:

Post a Comment

Your comments will be greatly appreciated ... Or just click the "Like" button above the comments section if you enjoyed this blog note.