Summary of "GPT-5.5 is a total freak"

Summary of GPT-5.5 Video (tech demos, features, benchmarks)

The video reviews OpenAI’s GPT-5.5 as a major upgrade, emphasizing that it’s more “performant” and better at agentic/automation workflows—especially in coding environments (not just chat). The creator claims it reduces mistakes and runs smoother than a prior generation model, while also highlighting that it can still struggle with high-stakes accuracy (medical imaging) and can hallucinate significantly on certain benchmarks.

Product / feature highlights

Key focus: agentic workflows
- GPT-5.5 is presented as “optimized” for agentic use in Codeex.
- Multiple agents can work across a folder-based project with iterative development, positioned as better than using ChatGPT’s single-chat interface.
Usage locations
- Available in the ChatGPT model selector (for paid tiers).
- Available in Codeex via the model dropdown.
Thinking effort / model variants
- The demo uses “thinking” modes (e.g., extra high and extended) to improve output quality/performance.

Demo 1: Interactive Earth “digital twin” (web 3D)

Goal: Generate an interactive 3D globe where users can zoom from space to city streets, with efficient browser loading.

Prompt requirements
- Realistic Earth rendering
- Publicly available assets/models/layers if needed
- Efficient load for a regular web browser
- City mapping on a 3D globe
- Layer toggles (e.g., nightlights, streets vs 3D buildings)
- Street view-like diving
Iteration / control shown
- Initial output: 3D buildings were “lackluster”
- Follow-up prompts:
  - Improve building rendering quality
  - Load buildings more efficiently
Result capabilities claimed
- Zoom/pan to specific locations (e.g., New York, San Francisco)
- Toggle night lights
- Render 3D buildings; optionally disable streets
- Provide a street view map view

Demo 2: Ray tracing simulator (adjustable materials)

Goal: Build a ray tracing simulation in standalone HTML using prompts.

Scene elements
- One sphere, cube, pyramid
- Blue sky + checkered ground
Material / physics parameters
- Sliders for position, reflectivity, roughness, translucency, color, and specular-ish controls
Web performance instruction
- Prompt includes a keyphrase intended to make it run smoothly in-browser.
Iteration / control shown
- Initial prompt: sliders added for the sphere only
- Additional prompts: sliders added for other shapes
- Fix: reflections through translucency consistency (reflected vs seen-through behavior)
Result behavior claimed
- Adjusting reflectivity changes correct reflections between objects
- Translucency changes appearance from metallic reflective to near-transparent
- Cube/pyramid respect their parameter sliders and interactions

Demo 3: Medical image analysis (CTs)

The video tests GPT-5.5 image understanding on cancer identification.

A) Chest CT lesions (4 slices)

Request: describe the photo and circle lesions
Outcome: 3 out of 4 slides correct
Not perfect: one lesion not circled correctly; some slides were obvious when correct.

B) Brain tumor identification (6 images)

Request: identify tumor types in each of six images
Outcome: not fully correct
- Multiple misclassifications (e.g., “no tumor” where a tumor type was expected; wrong tumor labels in several positions).
Conclusion: despite being “state-of-the-art,” it cannot reliably identify brain tumors from CT scans in this test.

Demo 4: Codeex “liquid splashes” lab with hand tracking

Goal: Create an interactive liquid splash simulator with adjustable physical/render parameters, controlled via webcam hand tracking.

Initial issues
- Lighting artifacts (“light flash every few seconds”)
- Cursor-following fake particles disliked
- Efficiency concerns for a web browser
Fixes / iterations
- Adjust lighting behavior (angle/intensity), background, performance
- Remove fake cursor particles
- Add more sliders controlling:
  - splash size/force
  - turbulence/flow speed
  - color speed / saturation
  - persistence (how long splashes remain)
  - gravity direction & magnitude
  - light angle and light power
- Add “enable hand” mode for webcam control
Result claimed
- “Fully functional” interface
- “Physically accurate” look
- Hand-controlled interaction smoother than other tested state-of-the-art models

Demo 5: 3D scene generation from a complex office image

Goal: Convert a messy isometric office image into a detailed 3D animated scene via a single HTML file.

Iteration
- Early results were “lackluster”
- Prompts requested more detail and coherence
- Specific consistency fixes (e.g., ceiling lights attached to ropes, screens on monitors)
Result claimed
- Generates tables/chairs/monitors, plus plants/books/humans
- Includes animations (e.g., monitor screens)
- Still “not perfect,” but substantially better than other top models in the creator’s comparison

Demo 6: Music composition + DAW UI

Goal: Have GPT-5.5 code a DAW-like interface in standalone HTML.

Instruments
- piano, synth, pluck, strings, drums, bass
Each instrument includes
- piano roll editor (drag/draw notes on timeline)
- play/pause and controls
Outcome
- Produced a “professional” 28-bar song attempt
Debugging
- Piano roll not appearing initially
- Alignment issues with the playhead; fixed with auto-panning / playhead visibility
Limitation
- Sounds are synthetic; suggests exporting MIDI into a better-sounding DAW for production.

Demo 7: 3D shooter game (3JS)

Goal: Build a functional 3D game using Three.js/3JS.

Genre / gameplay
- Futuristic battlefield
- Player controls a mecha warrior
- Enemies: waves of alien creatures from sky and ground
Must include
- third-person shooter perspective
- publicly available 3D assets
- AAA-like aim/UX
Iterations
- Adjust camera/view so aim icon isn’t blocked
- Fix aiming/shooting alignment issues
Result claimed
- Fully functional 3D shooter
- Multiple waves/levels
- Said to work from the start after a couple prompt fixes

Demo 8: “Frog test” (hidden object detection)

Task: find and circle a hidden frog in an image with a “think deeply, one chance” instruction.
Outcome: incorrect circled location.
Takeaway: models can still fail on certain “hidden object” reasoning tasks.

Agentic automation example: scraping leads + generating landing pages

Goal: Demonstrate automated business lead generation using Codeex agents.

Search: roofing companies in California (limit: 3)
Requirements:
- must have email
- must not have a website
Actions:
- scrape emails
- create a standalone HTML landing page per company using online logos/photos/info
Result claimed (~3 minutes)
- scraped lead emails
- generated individual landing pages
- landing pages included email + a “call now” button using phone info

“Deep research” capability in ChatGPT

Goal: Medical science synthesis task:

Analyze Alzheimer’s disease therapy mechanisms
Contrast therapies targeting each protein
Critically appraise cognitive/imaging outcomes from recent Phase 3 trials
Include relevant tables/visualizations and citations

Result claimed

Thought took several minutes
Output included executive synthesis, multiple sections, citations, multiple tables, and a written flowchart
Described as concise and professional with little filler

Hallucination + general reasoning sanity checks

“What does the S in ChatGPT stand for?”
- The model responded: “There is no S.”
- It reportedly held the line when prompted again.
Car wash test (walk vs drive 50m)
- Correctly advised to drive only if the goal is to bring the car to the car wash.

Benchmarks and spec claims (performance vs competitors)

The video compares GPT-5.5 against models like Claude Opus 4.7 and earlier GPT-5.4, using multiple leaderboards.

Claims
- GPT-5.5 extra high and high rank highly—often #1 or top spot.
Example cited benchmarks
- Terminal-Bench
  - GPT-5.5 claimed to beat Claude by ~12 percentage points
  - also uses fewer tokens
- Artificial Analysis (independent leaderboard)
  - extra high/high both ranked #1
  - context window claimed 922K tokens
- LiveBench (Abacus AI)
  - extra high ranks #1 slightly above GPT-5.4
- ARC-AGI-2
  - extra high highest scoring (~85%)
  - described as testing emergent pattern learning in visual puzzles
Token / cost trade-off
- GPT-5.5 claimed to be more expensive than GPT-5.4 despite improved efficiency (fewer tokens) in some tests.

Hallucination benchmark warning

The video reports a benchmark where GPT-5.5 hallucinates ~86% of the time (creator notes this is benchmark-specific, not necessarily “86% always”).
Implication / recommendation:
- If factual accuracy is critical (e.g., medical research or law), GPT-5.5 may not be best without strong verification.

Main speakers / sources

Speaker: The YouTube creator/reviewer narrating the demos and comparisons (name not given in subtitles).
Sponsors / mentions:
- HubSpot
- Codeex (tool/platform used for agentic coding demos)
- ChatGPT / OpenAI (model provider being reviewed)
Benchmarks / leaderboards referenced:
- Artificial Analysis
- Abacus AI
- ARC AGI / ARC-AGI-2