Summary of "GPT-5.2 is dumb (I’m tired of benchmarks)"
Summary of “GPT-5.2 is dumb (I’m tired of benchmarks)”
The video provides an in-depth critique and analysis of the GPT-5.2 model, focusing on its performance, usability, and how it compares to other AI models through various benchmarks and practical tests.
Key Technological Concepts and Product Features
GPT-5.2 Model Issues
- Despite being touted as a very smart model, GPT-5.2 exhibits notable flaws such as factual errors (e.g., miscounting letters), strange financial calculations, and odd behavior.
- The model feels less reliable and more problematic compared to previous versions, resembling issues seen in some Google models.
- The author suspects that excessive optimization for benchmarks (“benchmaxing”) has led to regressions in real-world usability.
Benchmarks and Evaluation
- Traditional benchmarks show GPT-5.2 excelling, but more realistic, user-generated benchmarks reveal regressions.
- The video references Simple Bench by AI Explained, which uses hard, private questions from experts. GPT-5.2 scored lower than expected, underperforming compared to Claude 4 Opus, Claude 4.1 Opus, Gro 4, and Gemini 2.5 Pro.
- The author’s own Skate Bench (a spatial reasoning and naming test for skateboard tricks) showed GPT-5.2 performing well on reasoning tasks but poorly on no-reasoning settings, indicating it relies on reasoning rather than memorization.
- GPT-5.2 Pro (the most expensive version) performs slightly better but at a significantly higher cost, raising questions about value.
Writing Arena Project
- A new benchmarking approach where multiple models write essays, then review each other’s work, revise, and rank the essays in head-to-head comparisons.
- GPT-5.2 performs well in initial writing and is especially strong at applying feedback and instruction-following.
- Other models like Kimmy K2 (used on T3 Chat) are praised for natural tone and usability, often preferred for conversational tasks.
- Models like Gemini 3 Pro are criticized for poor writing quality and ignoring feedback, behaving unpredictably.
- Feedback from models like Claude is detailed and constructive, improving essay quality significantly when used to revise GPT-5.2’s drafts.
Instruction Following and Usability
- GPT models, particularly GPT-5 and GPT-5.2, excel at following instructions precisely, with GPT-5 being more “sterile” (strictly doing what’s asked) and GPT-5.2 occasionally going off-task.
- This ability to follow and apply feedback is seen as GPT’s biggest current strength.
Performance and Speed
- GPT-5.2 is slow in generating content and handling complex tasks compared to other models.
- The video highlights Composer One (used in Cursor IDE) as a model that is less intelligent but much faster and more practical for day-to-day tasks.
- Speed and efficiency are valued more than raw intelligence for practical usage.
Model Comparison and Ecosystem
- The video contrasts Google’s Gemini line (which has seen less frequent updates and inconsistent quality) with OpenAI and Anthropic models that update more regularly.
- The author notes that smarter models don’t necessarily translate to better user experience or productivity.
- Opus 4.5 is considered a better overall model for practical use despite GPT-5.2’s higher intelligence.
Sponsored Content
- The video includes a sponsor segment for Blacksmith, a service offering high-performance CI (Continuous Integration) environments using gaming PC hardware to speed up GitHub Actions dramatically (2-4x faster, cheaper, with better caching and observability).
Reviews, Guides, and Tutorials Provided
-
Benchmarks
- Critique of standard benchmarks vs. custom benchmarks (Simple Bench, Skate Bench).
- Introduction of the “Writing Arena” project for more nuanced evaluation of models via essay writing, reviewing, revising, and ranking.
-
Model Usability Insights
- Detailed analysis of instruction-following capabilities.
- Comparison of model speed and practical utility (Composer One vs. GPT-5.2).
-
Writing Quality and Feedback Analysis
- Examples of essays from GPT-5.2, Gemini 3 Pro, and feedback from Claude.
- Demonstrates how feedback improves essay quality and highlights deficiencies in some models.
-
Practical Advice
- Recommendation to try Kimmy K2 for conversational use.
- Explanation of why faster, more obedient models may be preferable over just “smarter” models.
Main Speakers and Sources
- Primary Speaker/Creator: The YouTuber who runs the channel (unnamed in subtitles, but likely a known AI/tech reviewer).
- Referenced YouTubers:
- AI Explained – creator of the Simple Bench.
- Ben – mentioned as a user dissatisfied with GPT-5.1 and 5.2.
- Models Discussed:
- GPT-5, GPT-5.1, GPT-5.2 (including Pro and High variants)
- Claude 4 Opus, Claude 4.1 Opus, Claude 4.5 Opus
- Gemini 2.5 Pro, Gemini 3 Pro
- Kimmy K2 (on T3 Chat)
- Composer One (used in Cursor IDE)
- Gro 4
- Sponsor:
- Blacksmith (CI optimization service)
Overall Conclusion
GPT-5.2, despite its high benchmark scores and intelligence, suffers from practical usability regressions, slower performance, and occasional erratic behavior. The author prefers models that are faster and better at following instructions over raw intelligence. The video advocates for more nuanced benchmarking approaches and highlights the importance of instruction-following and feedback application in AI usability today.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.