-- Maxwell Zeff, TechCrunch, 8/7/25
-- GPT 5 for Apple Intelligence in The Vrege, Engadget, MacRumors,
-- Doubts and criticisms in Gizmodo, VentureBeat,
- Text TechCrunch
- Text The Information
- Text Gizmodo
This is a combined summary of three articles from TechCrunch, The Information, and Gizmodo.
GPT-5 is OpenAI’s first “unified” model, merging the reasoning ability of its o-series with the speed of its GPT-series. It aims to act more like an AI agent than a chatbot, performing tasks such as generating software, managing calendars, and creating research briefs.
- Real-time routing chooses between faster responses or deeper reasoning automatically.
- Replaces multiple user settings with adaptive, behind-the-scenes configuration.
OpenAI says GPT-5 slightly outperforms competitors like Claude Opus 4.1 and Gemini 2.5 Pro in coding, creative writing, and certain science benchmarks, though it lags in some tests. It claims reduced hallucinations and improved safety.
- SWE-bench Verified coding score: 74.9% (unverified reporter score attribution).
- GPQA Diamond science test: claimed 89.4% first-try accuracy.
GPT-5 is designed to hallucinate less on medical queries and be more proactive in flagging health concerns, while reducing unsafe answers without over-blocking harmless ones.
- HealthBench hallucination rate reportedly 1.6% (unverified).
- Deception rate claimed lower than previous models.
ChatGPT now offers selectable personalities (Cynic, Robot, Listener, Nerd), and GPT-5 is available in multiple API sizes with adjustable verbosity. Higher-tier subscribers get more usage and a “Pro” version for deeper reasoning.
- API pricing starts at $1.25 per million input tokens.
- gpt-oss, a free open-weight reasoning model, was also released.
OpenAI positions GPT-5 as a bellwether for AI’s progress, aiming to maintain leadership in both consumer adoption and enterprise integration. Its success or failure could influence Big Tech strategies and policy debates.
- Free users now default to GPT-5, expanding reach.
- Weekly ChatGPT usage reportedly at 700M people.
Internal development faced setbacks, including Orion’s failure to outperform GPT-4o and degraded results when reasoning models were converted to chat formats. This reflects a slowdown in AI capability jumps across the industry.
- Declining returns from pre-training due to limited high-quality data.
- Some techniques worked on small models but failed to scale.
OpenAI’s “universal verifier” uses one model to check another’s outputs, improving both verifiable domains (like coding) and subjective tasks. RL has become a core driver of GPT-5’s capabilities.
- Builds on Q* reasoning advances from late 2023.
- Also improves AI agent ability to handle complex, multi-rule tasks.
OpenAI is pushing automated coding to compete with Anthropic, while managing tensions with Microsoft over IP rights and equity stakes. Talent losses to Meta have further strained R&D continuity.
- Microsoft tests reportedly show GPT-5 quality gains without major compute cost increases.
- Anthropic’s lead in developer tools spurred renewed OpenAI coding focus.
Even incremental upgrades in GPT-5 could drive revenue growth and justify OpenAI’s planned $45B in infrastructure spending over 3.5 years. Internal optimism extends to possibly reaching “GPT-8” with current methods.
- Executives see coding automation as key to AI research efficiency.
- Microsoft likely to hold ~33% equity after restructure.
Despite OpenAI’s claims of reduced “effusive agreeableness,” GPT-5 still produces confident but wrong answers to simple factual questions, and can be manipulated by suggestive prompts.
- Example: incorrect responses to requests for a list of U.S. states whose names contain the letter "R", with occasional revision of a correct list in response to a user's bluffing assertion that the correct list was "incorrect".
- Demonstrates that polished rhetoric and benchmark gains don’t eliminate core generative AI limitations.
-- Maxwell Zeff, TechCrunch, 8/5/25
OpenAI releases a free GPT model that can run on your laptop
OpenAI has launched GPT-OSS, its first open-weight models since GPT-2 in 2019, marking a strategic shift from closed-only releases. The two variants—gpt-oss-120b and gpt-oss-20b—are available under the Apache 2.0 license, allowing commercial use, redistribution, and modification. They can run locally, be fine-tuned, and operate without internet access.
-
gpt-oss-120b performs similarly to OpenAI’s o4-mini; gpt-oss-20b is comparable to o3-mini and runs on devices with 16GB of VRAM.
-
Available free via Hugging Face, Databricks, Azure, and AWS.
Although text-only, GPT-OSS supports chain-of-thought reasoning, web browsing, code execution, and agent operation via APIs. OpenAI says the models are complementary to its paid offerings, aiming to give developers more control over data and customization.
-
Can integrate with closed models for hybrid workflows.
-
Intended for developers, smaller companies, and organizations seeking privacy and flexibility.
OpenAI claims GPT-OSS is its most rigorously tested model, involving external safety firms and internal “red-teaming” to explore misuse scenarios. Tests focused on risks like cybersecurity and bioweapons; the model did not reach high-risk levels under OpenAI’s preparedness framework.
-
Chain-of-thought output is exposed to monitor potential misuse.
-
Release was delayed earlier this year for additional safety review.
The release follows competitive pressure from open-weight leaders like Meta (Llama series) and Chinese startup DeepSeek. OpenAI’s move is positioned as keeping open innovation “based on democratic values” in the US. The models could challenge Meta’s developer appeal and influence the ongoing AI talent race.
-
Meta has hinted at pulling back from open releases over safety concerns.
-
OpenAI frames GPT-OSS as boosting global innovation while reinforcing its domestic leadership.
-- Cogni Down Under, Medium, 8/6/25
- Text Medium
- Text VentureBeat
(Combined summary from two articles)
Claude Opus 4.1 delivers measurable improvements across coding and reasoning benchmarks, scoring 74.5% on SWE-bench Verified (up from 72.5% in Opus 4). This outperforms OpenAI’s o3 (69.1%) and Google’s Gemini 2.5 Pro (67.2%), reinforcing Anthropic’s lead in AI coding tools. The model also posted gains in Terminal-Bench, GPQA Diamond, and AIME 2025, signaling steady, engineering-driven progress.
-
Incremental but broad-based performance gains across multiple benchmarks.
-
Retains a 200K-token context window with extended reasoning capacity up to 64K tokens.
Anthropic’s release emphasizes multi-file code refactoring, with GitHub and Rakuten praising its ability to make precise changes in large codebases without introducing bugs. Opus 4.1 also shows improved autonomous agent performance, enabling longer unsupervised tasks such as extended research or multi-step development projects.
-
Outperforms in large, complex code maintenance rather than just writing small functions.
-
Effective for both software development and independent research workflows.
Safety Standards and Pricing
Operating under AI Safety Level 3, Opus 4.1 achieves a 98.76% refusal rate for harmful requests while keeping benign refusals extremely low (0.08%). Pricing matches Opus 4 at $15/$75 per million tokens (input/output), with discounts for batch processing and prompt caching. For heavy coding use, costs can rival a junior developer’s salary, but may be justified for productivity gains.
-
Maintains strong safety controls without excessive over-blocking.
-
Pricing structure unchanged, with potential for cost optimization via caching.
Anthropic’s revenue has surged from $1B to $5B in seven months, but nearly half of its $3.1B API income comes from just two clients — Cursor and GitHub Copilot — generating $1.4B combined. This concentration creates vulnerability if either contract changes.
-
Rapid revenue growth highlights market demand.
-
Heavy reliance on a small number of large customers poses strategic risk.
No comments:
Post a Comment
Your comments will be greatly appreciated ... Or just click the "Like" button above the comments section if you enjoyed this blog note.