Summary of "GPT-5.4 Let Mickey Mouse Into a Production Database. Nobody Noticed. (What This Means For Your Work)"

Overview

Main headline: GPT‑5.4 is not uniformly best or worst — it’s the most interesting model tested because it strongly advances agentic/tool-driven capabilities while being inconsistent on writing, filtering, and some factual tasks. A key split exists between “thinking mode” (strong) and “auto mode” (often much weaker).

Evaluations performed and notable outcomes

  1. Business & creative writing

    • Opus 4.6 produced better prose, tone, and product/strategy–style writing.
    • GPT‑5.4 is an improvement over 5.2 but still weaker than Opus for editorial/executive communications.
  2. Verbal creativity (pun extraction and rewriting)

    • Opus 4.6 won due to deeper semantic handling.
    • GPT‑5.4 performed competently.
    • Gemini fabricated sources/URLs in this test.
  3. Schema migration / “eval from hell” (handwritten receipts, many file types, messy business data)

    • GPT‑5.4 excelled at discovery and parsing:
      • Discovered 461/465 files (99.1% coverage).
      • Handled OCR of handwritten receipts and many file types (CSV/Excel/JSON/PDF/VCF/corrupted backups).
      • Produced ~4,000+ line migration script, ~11,000+ line migration report, and 30 database tables.
    • Failings:
      • Very poor filtering/deduplication and hygiene (e.g., included a fake customer “Mickey Mouse” and a $25k car wash order).
      • Produced too many business‑status values and 278 customers vs correct 176 after dedupe.
    • Runtime: GPT‑5.4 ~56 minutes; Claude completed a similar task in ~15 minutes (but found less data); Gemini ~21 minutes.
  4. Epistemic calibration (factual accuracy & retrieval)

    • In thinking mode GPT‑5.4 performed well on precise facts (e.g., Higgs mass, stock close price).
    • In auto mode it hallucinated (named wrong Nobel winners for a future year), dropping to last place on this axis.
    • Large divergence between thinking mode and auto mode.
  5. Model self‑knowledge

    • GPT‑5.4 scored ~90% accurate on knowledge about its own capabilities (text, coding, media, open‑weight models) — best among the tested models.
  6. Product decision test (two‑sided product problem)

    • Opus 4.6 made the better decision.
    • GPT‑5.4 failed logically on this test; reviewer links better writing to better product reasoning.

Three main strengths of GPT‑5.4

  1. Quantitative modeling and analytical rigor

    • Builds deeper statistical models, documents assumptions and limitations (multi‑tab workbooks, ELO‑like systems, Pythagorean expectations).
    • Produces self‑critiques and improvement suggestions.
  2. Broad file/tool processing and tool‑use fluency

    • Handles many file types and common tooling with less friction (progressive tool discovery and runtime tool search).
    • Architectural advance for large tool ecosystems and agents.
  3. Knowledge of the competitive/model landscape

    • Better meta‑knowledge of frontier model capabilities and the AI ecosystem — useful for meta‑learning and coaching.

Primary weaknesses and failure modes

Product and technology features called out

Practical recommendations (when to use which model)

Strategic implications

Sources, speakers, and referenced parties

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video