Summary of "This model is kind of a disaster."
High-level summary
- A video tests Anthropic’s new public model Opus (Claude) 4.7 over a full day.
- The presenter began excited but found the model inconsistent: some genuinely useful improvements, but frequent regressions and harness/tooling problems that made it unreliable for real-world developer tasks.
What Opus 4.7 claims / notable new features
- Better instruction-following
- Follows literal instructions more exactly than earlier Claude models. Prompts tuned for older models may need retuning.
- Improved multimodal vision
- Accepts images up to ~2576 px on the long edge (~4MP) — better for dense screenshots, diagrams, and pixel-detail tasks.
- Stronger advanced software engineering
- Anthropic claims better handling of complex, long-running coding tasks, improved rigor, self‑verification of outputs, and gains on many benchmarks (notably agentic coding on some suites).
- Domain improvements
- Reported gains in finance/legal analyses, “memory” via file-system persistence across sessions, and slightly reduced misaligned outputs (still not Mythos-level).
- New Cloud Code controls
- “Extra high” effort level (between high and max), an ultra-review command for automated code reviews, auto mode (permission prompts routed to a classifier), and token/performance trade-off behavior (max uses many tokens).
- Availability & pricing
- Publicly available across Anthropic cloud/API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry.
- Pricing reported the same as Opus 4.6: $5 / million input tokens, $25 / million output tokens.
Limitations, safety & benchmark notes
- Not as capable as Claude Mythos preview (Anthropic limited Mythos and intentionally reduced cyber capabilities in Opus 4.7).
- Explicit safety measures
- Automatic detection/blocks for high-risk cybersecurity requests.
- Anthropic launched a cyber verification program for legitimate security researchers to ask higher-risk questions.
- Benchmarks
- Better on many benchmarks but worse on some (e.g., Agentic Search).
- Benchmark contamination was noted — scores are not definitive.
Hands-on findings
Harness / Cloud Code problems (dominated the experience)
- System prompt leakage
- Safety/system reminders were injected into user conversations in Cloud Code Desktop and flagged legitimate requests as prompt-injection or malware.
- Safety filters over-blocking
- Benign tasks (e.g., decoding a DEF CON “Gold Bug” puzzle) were paused and required fallback to much weaker models.
- Permission system flaky
- “Auto mode” and “bypass permissions” sometimes failed, causing repeated manual approvals and interrupting long tasks.
- File-edit / harness expectations
- Cloud Code expects a model to read files before modifying; Opus 4.7 repeatedly failed to follow that harness rule and attempted improper edits.
- Token usage
- “Max” settings burn huge numbers of tokens — significant cost implications for long runs.
Model behavior issues
- Failed web lookups / recon when required
- Recommended outdated Next.js 15 instead of Next.js 16 (broke builds) because it didn’t check latest package versions.
- Poor migration/versioning choices
- Suggested upgrades (e.g., Tailwind 3→4) without acknowledging breaking changes.
- Faulty generated scripts
- Cloning/zsh scripts copied unstaged/ignored files (rsync/git-ignore confusion) and didn’t switch branches as requested.
- Inconsistent behavior across runs
- One run could be excellent; the next run with the same prompt could fail badly.
Comparative testing
- In the presenter’s experience, GPT-5.4 handled the same tasks more consistently:
- Performed web fetches, found correct package versions, and was steadier for front-end modernization tasks.
- Harness routing differences matter
- Cursor and T3 Code harnesses route models differently; users may get different results depending on which harness they hit (Cursor sometimes routes to different internal harnesses).
Presenter’s interpretation / analysis
- Suspected cause of regressions
- Regressions mainly stem from bad/unstable engineering around Cloud Code and other harnesses rather than intrinsic model deterioration.
- Anthropic apparently uses internal tooling different from what public users get; the public Cloud Code rollout appears buggy and is degrading real-world experience.
- Critique of Anthropic processes
- Calls out Anthropic’s engineering/QA as a core problem — frequent sloppiness in the product makes the model seem worse.
- Overall verdict (presenter)
Opus 4.7 has real improvements and may be worth trying (especially if it’s already in a product you use), but it is inconsistent and currently unreliable for production-level developer workflows.
Practical guidance / takeaways
- If you depend on agents for long-running coding/refactor jobs, test Opus 4.7 on non-critical projects first and recheck prompts and harness behavior.
- Retune prompts written for older Claude models (Opus 4.7 follows instructions more literally).
- Watch for Cloud Code Desktop/CLI updates and harness-specific bugs (system prompt leakage, permission failures).
- Consider testing the same workflow against other models/harnesses (e.g., OpenAI / GPT‑5.4 in the presenter’s tests) to compare reliability and web lookup behavior.
Examples / demos referenced
- Modernizing an old project
- Produced a good plan but executed using outdated package versions, leading to broken builds.
- Cloning script for zsh
- Generated script copied unstaged/ignored files and failed to switch branch to main.
- Gold Bug puzzle (DEF CON)
- Model flagged the task as dangerous and paused interaction.
- Cloud Code Desktop
- Showed a system reminder that labeled the user’s own site as a prompt-injection/malware, forcing manual override.
Main speakers / sources cited
- Video presenter: Theo (creator/host running tests).
- Primary external sources and systems referenced: Anthropic (Opus/Claude 4.7 release notes), Claude Mythos / Project Glass Wing, Cloud Code (Desktop & CLI), Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, Cursor, T3 Code.
- Comparative reference: OpenAI (GPT‑5.4 / GPT family).
- On-video interactions / commenters: Anthropic staff and community members referenced (Ricky from React team, Poric, Ryan, Gurgley, Tibo from OpenAI).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...