Neoskeptics: Benchmarks

, Last update: Monday 8/11/25

When OpenAI released its GPT-5 model last week, its CEO Sam Altman boasted about GPT 5's scores on benchmark tests. In an effort to place Altman's boasts into a broader context, the editor of this blog constructed a "Leaderboard" of GPT 5's benchmark scores vs the scores of other leading genAI models.

... Skip to Leaderboard ➡ HERE

Development process

The editor engaged ChatGPT on GPT 5 to generate the first draft, which he submitted to Claude on Sonnet 4 for its corrections and suggested modifications. When ChatGPT completed its second draft, he asked Claude for another round of corrections and suggested modifications. The editor's own contributuions focused on the consistency of the drafts and their readability. The original final draft was comprehensive, impressive ... and unreadable by humans.

The editor then "vibe coded" a spec that he wanted the table to "pop out" into its own full screen window so that its data would not be "scrunched up" into multi-line hypenated entries. To his delight, ChatGPT came up with some obscure CSS code that delivered a scrollable pop-out table that was readable on a smartphone.

Indeed, the table was so clear that the editor easily spotted a few more glaring inconsistencies and probable errors. Claude's final corrections and modifications yielded the final version of the table.

Our readers are strongly advised to click the link that displays a scrollable copy of the table in a window that hovers over the blog page. We welcome your corrections and suggested modifications.

With a little help from our readers and from our chatbot "assistants", we will endeavor to keep this leaderboard up-to-date... 😎

Click HERE to view a more readable, scrollable, "unscrunched" table in its own window

LLM Benchmarks Leaderboard ... Only verified scores

Last updated: Aug 10, 2025 • Scores are % unless noted; “N/R” = not reported • Podium coloring: green = 1st, yellow = 2nd, pink = 3rd; no color if below 3.

Category	Benchmark	Protocol / Measure	GPT-5	Claude 4.1	Gemini 2.5 Pro	Grok 4	DeepSeek R1	Top-3
Core Reasoning & Math	AIME 2025	no tools	94.6	75.5	88.0	N/R	87.5	1) GPT-5 • 2) Gemini • 3) DeepSeek
Core Reasoning & Math	GPQA Diamond	pass@1	89.4	80.9	86.4	88.9	81.0	1) GPT-5 • 2) Grok • 3) Gemini
Core Reasoning & Math	ARC-AGI-2	abstract reasoning	9.9	8.6	15.9	15.9	N/R	1) Gemini • 1) Grok • 3) GPT-5
Core Reasoning & Math	Humanity’s Last Exam	no tools	42.0 (Pro)	11.5	21.64	44.4	14.04	1) Grok • 2) GPT-5 • 3) Gemini
Language Understanding	MMLU	5-shot	90.2	85.8	85.8	N/R	N/R	1) GPT-5 • 2) Claude/Gemini (tie)
Coding	SWE-bench Verified	single attempt	74.9	74.5	59.6	N/R	N/R	1) GPT-5 • 2) Claude • 3) Gemini
Coding	SWE-bench Verified	multi attempt	N/R	79.4	67.2	N/R	57.6	1) Claude • 2) Gemini • 3) DeepSeek
Coding	Aider Polyglot	pass@1	88.0	72.0	82.2	N/R	71.6	1) GPT-5 • 2) Gemini • 3) Claude
Coding	LiveCodeBench	Jan–May 2025	72.0	51.1	69.0	N/R	70.5	1) GPT-5 • 2) DeepSeek • 3) Gemini
Multimodal	MMMU	single attempt	84.2	76.5	82.0	N/R	N/R	1) GPT-5 • 2) Gemini • 3) Claude
Multimodal	VideoMME	video understanding	N/R	N/R	84.8	N/R	N/R	1) Gemini • 2) — • 3) —
Agent & Tool Use	Tau-bench (airline)	agent navigation	63.5	N/R	N/R	N/R	N/R	1) OpenAI o3 • 2) GPT-5 • 3) —
Agent & Tool Use	Tau-bench (retail)	agent navigation	81.1	82.4	N/R	N/R	N/R	1) Claude • 2) GPT-5 • 3) —
Specialized	HealthBench (Hard)	pass@1	46.2	N/R	N/R	N/R	N/R	1) GPT-5 • 2) — • 3) —
Long-Context & Factuality	MRCR v2	128k avg / 1M pointwise	N/R	N/R	58.0 / 16.4	N/R	N/R	1) Gemini (reported)
Long-Context & Factuality	SimpleQA	pass@1	N/R	N/R	54.0	N/R	N/R	1) Gemini
Long-Context & Factuality	FACTS grounding	grounding score	N/R	77.7	87.8	N/R	N/R	1) Gemini • 2) Claude
Safety & Robustness	StrongREJECT (jailbreak)	lower is better	N/R	2.24–6.71%	—	—	—	1) Claude
Safety & Robustness	BBQ (bias)	bias ↓ / accuracy ↑	N/R	0.21 / 99.8	—	—	—	1) Claude

Sources

Humanity’s Last Exam leaderboard [oai_citation:0‡Wikipedia](https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam?utm_source=chatgpt.com)
GPT-5 details & ARC-AGI-2 mention [oai_citation:1‡Tom's Guide](https://www.tomsguide.com/news/live/openai-chatgpt-5-live-blog?utm_source=chatgpt.com) [oai_citation:2‡For Every Scale](https://www.foreveryscale.com/p/gpt-5-vs-rivals-the-real-score?utm_source=chatgpt.com)
Grok 4 ARC-AGI-1 results [oai_citation:3‡THE DECODER](https://the-decoder.com/grok-4-edges-out-gpt-5-in-complex-reasoning-benchmark-arc-agi/?utm_source=chatgpt.com)
GPT-5 vs Gemini deep prompting takeaways [oai_citation:4‡Tom's Guide](https://www.tomsguide.com/ai/chatgpt/i-tested-chatgpt-vs-gemini-2-5-pro-with-these-3-prompts-and-it-shows-what-gpt-5-needs-to-do?utm_source=chatgpt.com)
GSM8K benchmark info [oai_citation:5‡klu.ai](https://klu.ai/glossary/GSM8K-eval?utm_source=chatgpt.com)
ARC-AGI-2 benchmark paper [oai_citation:6‡arxiv.org](https://arxiv.org/abs/2505.11831?utm_source=chatgpt.com)
System card for Claud Opus 4.1)
System card for GPT 5

Full Table (Close the window by clicking anywhere near to, but outside the window.)

Category	Benchmark	Protocol / Measure	GPT-5	Claude 4.1	Gemini 2.5 Pro	Grok 4	DeepSeek R1	Top-3
Core Reasoning & Math	AIME 2025	no tools	94.6	75.5	88.0	N/R	87.5	1) GPT-5 • 2) Gemini • 3) DeepSeek
Core Reasoning & Math	GPQA Diamond	pass@1	89.4	80.9	86.4	88.9	81.0	1) GPT-5 • 2) Grok • 3) Gemini
Core Reasoning & Math	ARC-AGI-2	abstract reasoning	9.9	8.6	15.9	15.9	N/R	1) Gemini • 1) Grok • 3) GPT-5
Core Reasoning & Math	Humanity’s Last Exam	no tools	42.0 (Pro)	11.5	21.64	44.4	14.04	1) Grok • 2) GPT-5 • 3) Gemini
Language Understanding	MMLU	5-shot	90.2	85.8	85.8	N/R	N/R	1) GPT-5 • 2) Claude/Gemini (tie)
Coding	SWE-bench Verified	single attempt	74.9	74.5	59.6	N/R	N/R	1) GPT-5 • 2) Claude • 3) Gemini
Coding	SWE-bench Verified	multi attempt	N/R	79.4	67.2	N/R	57.6	1) Claude • 2) Gemini • 3) DeepSeek
Coding	Aider Polyglot	pass@1	88.0	72.0	82.2	N/R	71.6	1) GPT-5 • 2) Gemini • 3) Claude
Coding	LiveCodeBench	Jan–May 2025	72.0	51.1	69.0	N/R	70.5	1) GPT-5 • 2) DeepSeek • 3) Gemini
Multimodal	MMMU	single attempt	84.2	76.5	82.0	N/R	N/R	1) GPT-5 • 2) Gemini • 3) Claude
Multimodal	VideoMME	video understanding	N/R	N/R	84.8	N/R	N/R	1) Gemini • 2) — • 3) —
Agent & Tool Use	Tau-bench (airline)	agent navigation	63.5	N/R	N/R	N/R	N/R	1) OpenAI o3 • 2) GPT-5 • 3) —
Agent & Tool Use	Tau-bench (retail)	agent navigation	81.1	82.4	N/R	N/R	N/R	1) Claude • 2) GPT-5 • 3) —
Specialized	HealthBench (Hard)	pass@1	46.2	N/R	N/R	N/R	N/R	1) GPT-5 • 2) — • 3) —
Long-Context & Factuality	MRCR v2	128k avg / 1M pointwise	N/R	N/R	58.0 / 16.4	N/R	N/R	1) Gemini (reported)
Long-Context & Factuality	SimpleQA	pass@1	N/R	N/R	54.0	N/R	N/R	1) Gemini
Long-Context & Factuality	FACTS grounding	grounding score	N/R	77.7	87.8	N/R	N/R	1) Gemini • 2) Claude
Safety & Robustness	StrongREJECT (jailbreak)	lower is better	N/R	2.24–6.71%	N/R	N/R	N/R	1) Claude
Safety & Robustness	BBQ (bias)	bias ↓ / accuracy ↑	N/R	0.21 / 99.8	N/R	N/R	N/R	1) Claude

Neoskeptics

Pages

Benchmarks

Sources

No comments:

Post a Comment