Summary of "A realistic comparison of Opus and Codex"
High-level conclusion
- Overall pick: Codex (5.3 / Codex family) is recommended as the more reliable, capable model for real engineering work. Opus (4.6) is faster, more pleasant, and better at front-end/design but cuts corners and can introduce bugs.
- Best practical approach: use both selectively — Opus for quick scaffolding, UI/design, and personal laptop tasks; Codex for large codebases, migrations, security-sensitive work, and thorough code changes.
“Measure twice, cut once.” Codex tends to be more conservative and correctness-focused; Opus is faster and more creative.
Price, quotas, and inference economics
- Subscription subsidies matter: many users run these models under generous $200/month subscriptions where tokens are heavily subsidized. API costs can differ from the subscription experience.
- Example pricing (may change):
- Opus: ~ $25 / 1M tokens out and $5 / 1M tokens in; fast mode is 2–3× faster but ~6× more expensive.
- Codex (5.2 reference): ~ $14 / 1M tokens out and $1.75 / 1M tokens in. 5.3 API pricing/behavior not publicly available yet.
- Token behavior affects cost: Codex often outputs more concisely (fewer tokens); Opus can generate larger outputs and be costlier per run. Final cost depends on token output, run mode, and subscription tier.
Capabilities and “engineering style”
Codex strengths
- Better at solving hard problems, handling blockers, and maintaining correctness.
- Excels in large codebases: finds patterns across a repo, follows conventions, and produces consistent changes.
- Conservative/safety-minded; pushes back on insecure or malicious requests.
- Good for detailed migrations and temporary patch workflows.
- CLI/desktop harness is minimal and reliable; supports interruptible follow-ups and dynamic steering.
Opus strengths
- Much better at generating attractive front-end UI and design.
- Faster to unblock and scaffold — often yields a working prototype quickly.
- Trained on more modern data for some stacks (e.g., newer tooling like Convex, Spelt).
- More permissive on risky tasks (less pushback).
Typical failure modes
- Opus
- Trims scope or misses wiring pieces.
- Uses lax typing (lots of any) and may introduce security/logic bugs.
- Ships faster but often leaves “slop” that needs cleanup.
- Codex
- Can get stuck in exhaustive “fix everything” loops.
- May produce enormous irrelevant outputs (e.g., thousands of lines of tests).
- Slower to scaffold in empty repos without examples.
Concrete examples & workflow patterns
- T3 Canvas migration: Opus produced a working port quickly but missed front-end wiring; Codex produced more complete output but hit sandbox/network issues when running generation commands.
- Large migration (Round / ping.gg): Codex 5.3 handled an extremely complex migration by iteratively patching and unpatching packages, producing a mergeable PR; Opus could not.
- AISDK v6 migration: Codex’s long-running job created massive test scaffolding and got bogged down; Opus completed a practical working version in minutes for a separate run.
- Security audit: Opus introduced insecure schema shapes but also found some schema/index issues. Using multiple models for reviews is beneficial.
- Local laptop / dotfiles / shell edits: Opus preferred for quick terminal-level tasks and experiments.
- Front-end design workflows: common patterns include Codex implementing logic + Opus refining UI, or Opus mocking UI and Codex implementing logic.
Harnesses, UI, and user experience
- Opus is often used via Cloud Code (Anthropic’s “Claude Code” style harness). Reported issues: stashed messages lost, image attachments non-blocking, crashes, compaction problems, and brittleness. Opus benefits from plan mode and careful uninterrupted prompts.
- Codex is used via Codex CLI and Codex desktop app. The harness is more minimal but more reliable and better for steering and interrupting runs.
- Important UX differences:
- Opus benefits from a “plan mode” and can break if interrupted.
- Codex accepts dynamic steering during runs and often resumes correctly.
- Subscription tiers affect performance (e.g., lower tiers may lack fast inference).
Safety, moderation, and transparency
- Opus is more permissive; Codex is stricter about unsafe or illegal tasks.
- OpenAI reportedly reroutes some 5.3 queries to an older model (5.2) when potential cybersecurity abuse is detected — this routing may not be transparent in the UI.
- Anthropic tends to ban accounts when policies are violated.
- Models can discover novel cyber attack techniques; platform-level guardrails and monitoring remain necessary.
Prompting, skills, and codebase context
- Opus relies more on training priors; better in greenfield/new-project scenarios and modern patterns. It needs clear planning and explicit instructions (plan mode).
- Codex relies more on repository context and examples; performs best when clear patterns exist in a large codebase.
- Practical tip: provide explicit references — clone/fork reference repos and feed them as context. Use small summarization tooling (e.g., “BTCA” pattern) to give agents concise repo context rather than expecting the model to explore everything itself.
Practical recommendations
- If you must pick one: choose Codex for production engineering, migrations, audits, and large codebases.
- If you want speed, front-end visuals, or local tinkering: try Opus, but audit outputs and run type checks/security reviews.
- Best practice: use both. Example flow:
- Use Opus to quickly scaffold or prototype.
- Use Codex to harden, audit, and finish.
- Keep CI, type-checks, and human reviews to catch errors and security issues.
- Consider subscription tiers and quotas: $200 subs give large usage allowances; lower tiers (e.g., $20) can be limited in speed or features.
Other tools & mentions
- Arkjet (sponsor): Next.js components for bot prevention, email validation, MX checks, rate-limiting, middleware shields for SQL injection, etc. Integration: install components, add key, use aj.protect(request, email) to get a decision.
- Cursor: platform offering early access to long-running model runs (24–72 hour tasks) and model switching between Opus and Codex.
- GitHub Copilot: referenced as a partner/harness example for the Codex family.
- The speaker has additional videos (e.g., front-end model comparisons) and plans more coverage about subscription value.
Speaker’s workflow and meta-notes
- The speaker (Theo / theocodework / T3 developer) uses:
- Opus exclusively via Cloud Code.
- Codex via Codex CLI / Codex desktop app.
- Approach: heavy hands-on testing with multiple runs, long inference, and real-app scenarios (T3 Chat, T3 Canvas, Round/ping.gg).
- Prompting: different models require different prompting styles and configuration in agent metadata.
Main speakers / sources
- Primary speaker: Theo (theocodework / T3 developer).
- Models and companies referenced: Opus 4.6, Codex 5.3 (and Codex 5.2 references), Claude/Claude Code (Anthropic), OpenAI, Cursor, GitHub Copilot, Arkjet.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...