Summary of "GPT-5.5 is a total freak"
Summary of GPT-5.5 Video (tech demos, features, benchmarks)
The video reviews OpenAI’s GPT-5.5 as a major upgrade, emphasizing that it’s more “performant” and better at agentic/automation workflows—especially in coding environments (not just chat). The creator claims it reduces mistakes and runs smoother than a prior generation model, while also highlighting that it can still struggle with high-stakes accuracy (medical imaging) and can hallucinate significantly on certain benchmarks.
Product / feature highlights
-
Key focus: agentic workflows
- GPT-5.5 is presented as “optimized” for agentic use in Codeex.
- Multiple agents can work across a folder-based project with iterative development, positioned as better than using ChatGPT’s single-chat interface.
-
Usage locations
- Available in the ChatGPT model selector (for paid tiers).
- Available in Codeex via the model dropdown.
-
Thinking effort / model variants
- The demo uses “thinking” modes (e.g., extra high and extended) to improve output quality/performance.
Demo 1: Interactive Earth “digital twin” (web 3D)
Goal: Generate an interactive 3D globe where users can zoom from space to city streets, with efficient browser loading.
-
Prompt requirements
- Realistic Earth rendering
- Publicly available assets/models/layers if needed
- Efficient load for a regular web browser
- City mapping on a 3D globe
- Layer toggles (e.g., nightlights, streets vs 3D buildings)
- Street view-like diving
-
Iteration / control shown
- Initial output: 3D buildings were “lackluster”
- Follow-up prompts:
- Improve building rendering quality
- Load buildings more efficiently
-
Result capabilities claimed
- Zoom/pan to specific locations (e.g., New York, San Francisco)
- Toggle night lights
- Render 3D buildings; optionally disable streets
- Provide a street view map view
Demo 2: Ray tracing simulator (adjustable materials)
Goal: Build a ray tracing simulation in standalone HTML using prompts.
-
Scene elements
- One sphere, cube, pyramid
- Blue sky + checkered ground
-
Material / physics parameters
- Sliders for position, reflectivity, roughness, translucency, color, and specular-ish controls
-
Web performance instruction
- Prompt includes a keyphrase intended to make it run smoothly in-browser.
-
Iteration / control shown
- Initial prompt: sliders added for the sphere only
- Additional prompts: sliders added for other shapes
- Fix: reflections through translucency consistency (reflected vs seen-through behavior)
-
Result behavior claimed
- Adjusting reflectivity changes correct reflections between objects
- Translucency changes appearance from metallic reflective to near-transparent
- Cube/pyramid respect their parameter sliders and interactions
Demo 3: Medical image analysis (CTs)
The video tests GPT-5.5 image understanding on cancer identification.
A) Chest CT lesions (4 slices)
- Request: describe the photo and circle lesions
- Outcome: 3 out of 4 slides correct
- Not perfect: one lesion not circled correctly; some slides were obvious when correct.
B) Brain tumor identification (6 images)
- Request: identify tumor types in each of six images
-
Outcome: not fully correct
- Multiple misclassifications (e.g., “no tumor” where a tumor type was expected; wrong tumor labels in several positions).
-
Conclusion: despite being “state-of-the-art,” it cannot reliably identify brain tumors from CT scans in this test.
Demo 4: Codeex “liquid splashes” lab with hand tracking
Goal: Create an interactive liquid splash simulator with adjustable physical/render parameters, controlled via webcam hand tracking.
-
Initial issues
- Lighting artifacts (“light flash every few seconds”)
- Cursor-following fake particles disliked
- Efficiency concerns for a web browser
-
Fixes / iterations
- Adjust lighting behavior (angle/intensity), background, performance
- Remove fake cursor particles
- Add more sliders controlling:
- splash size/force
- turbulence/flow speed
- color speed / saturation
- persistence (how long splashes remain)
- gravity direction & magnitude
- light angle and light power
- Add “enable hand” mode for webcam control
-
Result claimed
- “Fully functional” interface
- “Physically accurate” look
- Hand-controlled interaction smoother than other tested state-of-the-art models
Demo 5: 3D scene generation from a complex office image
Goal: Convert a messy isometric office image into a detailed 3D animated scene via a single HTML file.
-
Iteration
- Early results were “lackluster”
- Prompts requested more detail and coherence
- Specific consistency fixes (e.g., ceiling lights attached to ropes, screens on monitors)
-
Result claimed
- Generates tables/chairs/monitors, plus plants/books/humans
- Includes animations (e.g., monitor screens)
- Still “not perfect,” but substantially better than other top models in the creator’s comparison
Demo 6: Music composition + DAW UI
Goal: Have GPT-5.5 code a DAW-like interface in standalone HTML.
-
Instruments
- piano, synth, pluck, strings, drums, bass
-
Each instrument includes
- piano roll editor (drag/draw notes on timeline)
- play/pause and controls
-
Outcome
- Produced a “professional” 28-bar song attempt
-
Debugging
- Piano roll not appearing initially
- Alignment issues with the playhead; fixed with auto-panning / playhead visibility
-
Limitation
- Sounds are synthetic; suggests exporting MIDI into a better-sounding DAW for production.
Demo 7: 3D shooter game (3JS)
Goal: Build a functional 3D game using Three.js/3JS.
-
Genre / gameplay
- Futuristic battlefield
- Player controls a mecha warrior
- Enemies: waves of alien creatures from sky and ground
-
Must include
- third-person shooter perspective
- publicly available 3D assets
- AAA-like aim/UX
-
Iterations
- Adjust camera/view so aim icon isn’t blocked
- Fix aiming/shooting alignment issues
-
Result claimed
- Fully functional 3D shooter
- Multiple waves/levels
- Said to work from the start after a couple prompt fixes
Demo 8: “Frog test” (hidden object detection)
- Task: find and circle a hidden frog in an image with a “think deeply, one chance” instruction.
- Outcome: incorrect circled location.
- Takeaway: models can still fail on certain “hidden object” reasoning tasks.
Agentic automation example: scraping leads + generating landing pages
Goal: Demonstrate automated business lead generation using Codeex agents.
- Search: roofing companies in California (limit: 3)
- Requirements:
- must have email
- must not have a website
-
Actions:
- scrape emails
- create a standalone HTML landing page per company using online logos/photos/info
-
Result claimed (~3 minutes)
- scraped lead emails
- generated individual landing pages
- landing pages included email + a “call now” button using phone info
“Deep research” capability in ChatGPT
Goal: Medical science synthesis task:
- Analyze Alzheimer’s disease therapy mechanisms
- Contrast therapies targeting each protein
- Critically appraise cognitive/imaging outcomes from recent Phase 3 trials
- Include relevant tables/visualizations and citations
Result claimed
- Thought took several minutes
- Output included executive synthesis, multiple sections, citations, multiple tables, and a written flowchart
- Described as concise and professional with little filler
Hallucination + general reasoning sanity checks
-
“What does the S in ChatGPT stand for?”
- The model responded: “There is no S.”
- It reportedly held the line when prompted again.
-
Car wash test (walk vs drive 50m)
- Correctly advised to drive only if the goal is to bring the car to the car wash.
Benchmarks and spec claims (performance vs competitors)
The video compares GPT-5.5 against models like Claude Opus 4.7 and earlier GPT-5.4, using multiple leaderboards.
-
Claims
- GPT-5.5 extra high and high rank highly—often #1 or top spot.
-
Example cited benchmarks
- Terminal-Bench
- GPT-5.5 claimed to beat Claude by ~12 percentage points
- also uses fewer tokens
- Artificial Analysis (independent leaderboard)
- extra high/high both ranked #1
- context window claimed 922K tokens
- LiveBench (Abacus AI)
- extra high ranks #1 slightly above GPT-5.4
- ARC-AGI-2
- extra high highest scoring (~85%)
- described as testing emergent pattern learning in visual puzzles
- Terminal-Bench
-
Token / cost trade-off
- GPT-5.5 claimed to be more expensive than GPT-5.4 despite improved efficiency (fewer tokens) in some tests.
Hallucination benchmark warning
- The video reports a benchmark where GPT-5.5 hallucinates ~86% of the time (creator notes this is benchmark-specific, not necessarily “86% always”).
- Implication / recommendation:
- If factual accuracy is critical (e.g., medical research or law), GPT-5.5 may not be best without strong verification.
Main speakers / sources
- Speaker: The YouTube creator/reviewer narrating the demos and comparisons (name not given in subtitles).
- Sponsors / mentions:
- HubSpot
- Codeex (tool/platform used for agentic coding demos)
- ChatGPT / OpenAI (model provider being reviewed)
- Benchmarks / leaderboards referenced:
- Artificial Analysis
- Abacus AI
- ARC AGI / ARC-AGI-2
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.