Summary of "This model is kind of a disaster."

High-level summary

A video tests Anthropic’s new public model Opus (Claude) 4.7 over a full day.
The presenter began excited but found the model inconsistent: some genuinely useful improvements, but frequent regressions and harness/tooling problems that made it unreliable for real-world developer tasks.

Better instruction-following
- Follows literal instructions more exactly than earlier Claude models. Prompts tuned for older models may need retuning.
Improved multimodal vision
- Accepts images up to ~2576 px on the long edge (~4MP) — better for dense screenshots, diagrams, and pixel-detail tasks.
Stronger advanced software engineering
- Anthropic claims better handling of complex, long-running coding tasks, improved rigor, self‑verification of outputs, and gains on many benchmarks (notably agentic coding on some suites).
Domain improvements
- Reported gains in finance/legal analyses, “memory” via file-system persistence across sessions, and slightly reduced misaligned outputs (still not Mythos-level).
New Cloud Code controls
- “Extra high” effort level (between high and max), an ultra-review command for automated code reviews, auto mode (permission prompts routed to a classifier), and token/performance trade-off behavior (max uses many tokens).
Availability & pricing
- Publicly available across Anthropic cloud/API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry.
- Pricing reported the same as Opus 4.6: $5 / million input tokens, $25 / million output tokens.

Not as capable as Claude Mythos preview (Anthropic limited Mythos and intentionally reduced cyber capabilities in Opus 4.7).
Explicit safety measures
- Automatic detection/blocks for high-risk cybersecurity requests.
- Anthropic launched a cyber verification program for legitimate security researchers to ask higher-risk questions.
Benchmarks
- Better on many benchmarks but worse on some (e.g., Agentic Search).
- Benchmark contamination was noted — scores are not definitive.

System prompt leakage
- Safety/system reminders were injected into user conversations in Cloud Code Desktop and flagged legitimate requests as prompt-injection or malware.
Safety filters over-blocking
- Benign tasks (e.g., decoding a DEF CON “Gold Bug” puzzle) were paused and required fallback to much weaker models.
Permission system flaky
- “Auto mode” and “bypass permissions” sometimes failed, causing repeated manual approvals and interrupting long tasks.
File-edit / harness expectations
- Cloud Code expects a model to read files before modifying; Opus 4.7 repeatedly failed to follow that harness rule and attempted improper edits.
Token usage
- “Max” settings burn huge numbers of tokens — significant cost implications for long runs.

Failed web lookups / recon when required
- Recommended outdated Next.js 15 instead of Next.js 16 (broke builds) because it didn’t check latest package versions.
Poor migration/versioning choices
- Suggested upgrades (e.g., Tailwind 3→4) without acknowledging breaking changes.
Faulty generated scripts
- Cloning/zsh scripts copied unstaged/ignored files (rsync/git-ignore confusion) and didn’t switch branches as requested.
Inconsistent behavior across runs
- One run could be excellent; the next run with the same prompt could fail badly.

In the presenter’s experience, GPT-5.4 handled the same tasks more consistently:
- Performed web fetches, found correct package versions, and was steadier for front-end modernization tasks.
Harness routing differences matter
- Cursor and T3 Code harnesses route models differently; users may get different results depending on which harness they hit (Cursor sometimes routes to different internal harnesses).

Suspected cause of regressions
- Regressions mainly stem from bad/unstable engineering around Cloud Code and other harnesses rather than intrinsic model deterioration.
- Anthropic apparently uses internal tooling different from what public users get; the public Cloud Code rollout appears buggy and is degrading real-world experience.
Critique of Anthropic processes
- Calls out Anthropic’s engineering/QA as a core problem — frequent sloppiness in the product makes the model seem worse.
Overall verdict (presenter)

Opus 4.7 has real improvements and may be worth trying (especially if it’s already in a product you use), but it is inconsistent and currently unreliable for production-level developer workflows.

If you depend on agents for long-running coding/refactor jobs, test Opus 4.7 on non-critical projects first and recheck prompts and harness behavior.
Retune prompts written for older Claude models (Opus 4.7 follows instructions more literally).
Watch for Cloud Code Desktop/CLI updates and harness-specific bugs (system prompt leakage, permission failures).
Consider testing the same workflow against other models/harnesses (e.g., OpenAI / GPT‑5.4 in the presenter’s tests) to compare reliability and web lookup behavior.

Modernizing an old project
- Produced a good plan but executed using outdated package versions, leading to broken builds.
Cloning script for zsh
- Generated script copied unstaged/ignored files and failed to switch branch to main.
Gold Bug puzzle (DEF CON)
- Model flagged the task as dangerous and paused interaction.
Cloud Code Desktop
- Showed a system reminder that labeled the user’s own site as a prompt-injection/malware, forcing manual override.

Video presenter: Theo (creator/host running tests).
Primary external sources and systems referenced: Anthropic (Opus/Claude 4.7 release notes), Claude Mythos / Project Glass Wing, Cloud Code (Desktop & CLI), Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, Cursor, T3 Code.
Comparative reference: OpenAI (GPT‑5.4 / GPT family).
On-video interactions / commenters: Anthropic staff and community members referenced (Ricky from React team, Poric, Ryan, Gurgley, Tibo from OpenAI).