Summary of "Как НИКОГДА не упираться в лимиты | Claude и Codex"
Main Topic
A guide on how not to hit context/usage limits in chat-based AI coding assistants (Claude, Claude Code, Codex, Cursor, etc.) by reducing token consumption and optimizing chat/workflows.
Key Technological Concepts Explained
Tokens
- Tokens are the basic units of text consumed by the model; every character/punctuation counts.
- Approximate rule of thumb:
- ~3/4 of an English word per token
- Language note:
- Russian is less efficient: 1 Russian word ≈ 2–3 tokens
- So Russian text can cost ~2–3× more tokens.
Context window / memory in a chat
- The model must re-read all prior messages every time it generates a response.
- In long chats, most tokens are spent on re-processing history, not producing the new answer.
Context degradation
- As the chat grows, the model can “forget” or become less sharp.
- The described effect is getting “dull” around ~100 messages.
Token cost asymmetry
- Input tokens are cheaper than output tokens (claim: output can be ~5× more expensive).
- Because the model generates output word-by-word and re-processes prior generated text, long answers cost significantly more tokens.
Practical “Life Hacks” and Product Features (How-To)
-
Don’t ask unnecessary questions
- Avoid “re-asking” after an incorrect answer with extra complaints or redundant rework, which burns tokens faster.
-
Use rollback features in Claude Code / Codex / Cursor
- If the agent’s edits are a mess, use rollback instead of continuing from a bad state.
- Rollback modes (conceptually):
- Rewind conversation: rolls back dialogue history to a point.
- Rewind code: rolls back code/files to an earlier point while keeping chat history.
- Full rollback / fork conversation: forks the chat and restores code + conversation from before the mistake.
- Goal: reduce wasted context and avoid forcing the model to redo everything.
-
Edit earlier messages instead of re-asking
- Example: change “Tula” → “Irkutsk” via an inline pencil edit so the assistant rewrites without expanding the dialogue.
-
Ask multiple questions at once (when appropriate)
- Bundling reduces how often the model must re-run long chat history.
- Caveat: not for very large tasks.
-
Use a “listen/side question” style (non-blocking queries)
- In Codex-like terminals: start a task, then use a “listen” mode (or similar) to ask a clarifying question without interrupting the main generation thread.
- Note: after using Escape, both the main and side content are removed from the visible chat (as described).
-
Force short, direct answers via settings
- Configure Claude/Codex to respond without fluff to reduce output token usage.
-
Create new chats periodically
- Don’t cram a full multi-stage project into one conversation.
- Recommended structure: one big task per chat; finish it, then start a new chat for the next phase.
- Use Markdown instruction files when you need context carried forward.
-
Use subagents for heavy analysis
- For tasks like an SEO audit:
- delegate deep analysis to a subagent
- only the final summary returns to the main chat
- This is described as available in both cloud and codex environments.
- Subagents/agents are positioned as instructions defining model, tools, and task scope.
- For tasks like an SEO audit:
-
Use Skills (Markdown-based instructions)
- Create reusable “skill” instructions for recurring tasks so the model doesn’t re-derive the procedure each time.
- Skills/agents only load their title + description until needed, reducing baseline context load.
-
Stay under a token budget target (~120k)
- Even if the context window is large (e.g., up to ~1M tokens), the practical recommendation is limiting usage to around 120,000 tokens due to ongoing history reprocessing and slowdown/context degradation.
- An update is mentioned that increased context window size, but the degradation/slowdown issue is argued to remain.
-
Convert documents/files to Markdown
- Don’t feed PDFs/Word/HTML directly when possible.
- Reason: PDFs/Word include lots of non-text metadata (font info, coordinates, layout, images, profiles), which becomes extra tokens.
- Markdown is treated as “native”/preferred and can drastically reduce token footprint (example given: 15M tokens vs 8k tokens).
-
Manage Claude.md / Agents.md size (keep it small)
- These global/project instruction files are loaded at session start.
- If too large, they waste tokens immediately and can pollute context.
- Rule of thumb: keep them to ~200 lines, and move large blocks (e.g., design systems) into separate Markdown files referenced when needed.
-
Disable unnecessary tools (MCPs/connectors/features)
- Extra tools consume tokens and increase latency simply by being available.
- Example: disable an MCP like a screenshot/UI-checker if not needed.
- Described as navigating to MCP list and pressing disable.
-
Pick lightweight models for lightweight tasks
- Use smaller models for routine edits (example tiers: haiku/sunet/opus).
- Bigger models cost more tokens/thinking; reserve them for complex work.
-
Turn on planning mode before execution
- Enable a plan-first workflow: 1) the agent reads inputs/specs 2) generates a plan without changing code 3) you approve/adjust 4) only then it executes
- Goal: reduce major mistakes and rework, saving tokens.
-
Use a planning-focused plugin (“Superpers”)
- A plugin (reviewed elsewhere by the speaker) that uses built-in skills for a structured workflow:
- brainstorming → planning → implementation → testing
- Recommended mainly for larger tasks; considered overkill for trivial ones.
- A plugin (reviewed elsewhere by the speaker) that uses built-in skills for a structured workflow:
“Stretching” Usage Limits in Claude (5-Hour Window)
Daily 5-hour limits window
- Claude has a daily 5-hour limits window (Pro/Max/Team/Enterprise mentioned).
- The window starts from the first message sent.
- If you hit limits early, the quota resumes after the 5-hour window ends.
Workaround: Claude Routines
- Create a routine that sends a small trigger message (e.g., at 6:00 AM) so the 5-hour window resets and you extend working time with fewer interruptions.
Status update on limits (recent change described)
- On May 6, 2026, Anthropic reportedly expanded infrastructure (Memphis data center access) and doubled the 5-hour limits for Claude products (Pro/Max/Team/Enterprise).
- Also removed peak-hour restrictions and raised rate limits for Opus models, making it harder to hit limits.
Speaker’s Auxiliary Resources / Recommendations
- The speaker mentions creating full guides for:
- Claude Code
- Codex
- Promotes their Telegram channel for additional tips, including legal/web development info and recurring AI news/podcast releases.
- Mentions they may add missing items to the Telegram channel.
Main Speakers / Sources
- Primary speaker: the video’s narrator/author (unnamed in subtitles).
- Referenced sources: unnamed developer(s) who measured token usage in long chats (example claim: ~98.5% tokens spent on processing prior responses).
- Company referenced: Anthropic (Claude).
- Tools/products referenced: Claude, Claude Code, Codex, Cursor, plus MCPs, Agents/Subagents, Skills, Claude Routines.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.