brandonwie.dev
EN / KR
On this page
devops knowledgedevopsai-mlclaude-code

Anthropic Prompt Cache TTL + Cost Mechanics

Anthropic silently dropped Claude Code's prompt-cache TTL from 1 hour to 5 minutes around early March 2026. Without explicit awareness, idle gaps ≥5 min between messages evaporate the cache and force a full cold cache-write on the next message — pricing it at 1.25× base input on the entire conversation prefix.

Updated May 6, 2026 6 min read

Anthropic silently dropped Claude Code’s prompt-cache TTL from 1 hour to 5 minutes around early March 2026 (issue #46829). Without explicit awareness, idle gaps ≥5 min between messages evaporate the cache and force a full cold cache-write on the next message — pricing it at 1.25× base input on the entire conversation prefix (system prompt + tools + CLAUDE.md + every prior turn). On a 200K-token Opus session that’s ~$1.25 per resume; across a working day this can raise per-session cost 30–60%.

Pre-regression, sessions running 1h+ between messages stayed warm. Post-regression, walking-away patterns (lunch, meetings, focus blocks) cost real money — and many users didn’t notice because there was no announcement, no release-note line, no banner.

Cache mechanics, verified

TTL options

TTLDefault?Refresh behavior
5 minYES (post-2026-03 regression)Each cache hit resets the timer (sliding window). Active sessions stay warm forever.
1 hOpt-in via cache_control: { ttl: "1h" } on API requestsSame sliding-window behavior, longer dead-clock. NOT user-selectable in Claude Code today.

Pricing multipliers (vs base input)

OperationMultiplier vs base input
Cache read (hit)0.10× (10%)
Cache refresh on hit0.10× (same as read)
5m cache write (cold/miss)1.25×
1h cache write (cold/miss)2.00×

Claude Opus 4.7 numbers: base input \$5/MTok, 5m write \$6.25, 1h write \$10, read/refresh \$0.50, output \$25.

What “5m vs 1h is 2x more expensive” actually means

The “2×” claim circulating in user discussions compares 1h cache write to uncached base input (2.0× vs 1.0×). The ratio of 1h write to 5m write is 1.6× (2.0 / 1.25). Both are correct depending on the comparison frame; the “2×” framing only makes sense vs uncached.

What invalidates the cache

The cache key is a hash of the full prefix in order: system prompt + tool definitions + CLAUDE.md + conversation history. Changing any portion invalidates everything from that point onward.

ChangeEffect
Edit CLAUDE.md mid-sessionPrefix changes → all cache dies → every subsequent message reprocessed
Add/remove MCP server mid-sessionTool defs change → full invalidation. Claude Code’s design locks tool list at startup to prevent this.
Switch model (Opus ↔ Sonnet)Different model = different cache. tool_choice changes also invalidate.
Timestamp / dynamic content in system promptPrefix differs every turn → never hits
/compactSafe — Claude Code rebuilds the conversation summary AFTER the same cached prefix (system + tools + CLAUDE.md). Prefix reuse is intentional.
/clearWipes session, next message cold

Why Claude Code’s design leans so hard on cache

Per Thariq Shihipar (Claude Code engineer) — prompt caching is the architectural constraint around which the product is built. They declare SEVs when cache hit rates drop. Concrete design choices that exist for cache reasons:

  1. Tool list locked at session start. Adding an MCP tool mid-session would change the prefix → invalidates everything. Claude Code refuses to register new tools after startup.
  2. Plan mode adds tools, never swaps. When plan mode was built, the obvious design was “swap to read-only tools.” Cache-aware design: keep ALL tools in the prompt always; add EnterPlanMode and ExitPlanMode as additional tools; send mode change as a user message. Tool defs never change between plan mode and normal mode.
  3. Compaction is a fork, not a rebuild. Compaction request uses the identical prefix as your current conversation (same system prompt, tools, CLAUDE.md). Only the messages portion gets summarized. Prefix KV cache is reused.

Without prompt caching, a 100-turn Opus coding session can cost \$50–\$100 in input tokens. With 90% hit rate, ~\$10–\$19. This economics is why Claude Code Pro (\$20/mo) is viable.

Cost math for an Opus 4.7 200K-token prefix

ScenarioCost
Cold write on resume after 5min idle200K × \$6.25/MTok = \$1.25
Subsequent in-window message200K × \$0.50/MTok = \$0.10
12 pings/hr (cache-keepalive idle)12 × ~\$0.10 = \$1.20/hr (if you used a keepalive)
10-resume day without keepalive~\$12.50 just in cold-write tax
Same session w/ cache-keepalive~\$5/day (continuous warm)

For Pro/Max subscribers: cache misses don’t bill \$ (flat fee), but they consume rate-limit (5hr / weekly Opus quota). High cache miss rate burns your quota faster.

Cost levers user-controllable in Claude Code

LeverAction
Keep sessions activeDon’t let cache expire mid-task. Type a filler turn before a known break, OR install cache-keepalive (see related).
Slim CLAUDE.mdLoaded at session start → sits in cached prefix forever. Move workflow detail to skills which lazy-load on invoke.
Lock MCP servers up frontDon’t toggle servers mid-session. Configure .mcp.json once.
Pin model per sessionDon’t switch Opus ↔ Sonnet inside one task.
Subagents for verbose opsHeavy file reads / log dumps → subagent. Verbose tokens stay in subagent context, only summary returns.
/compact is fineDesigned cache-aware. Use freely when context fills.
Monitor /usageCache hit ratio below 90% = something invalidating prefix. Investigate.
Avoid agent teams unless needed~7× tokens vs solo session (each teammate has own context).
Start new session for unrelated tasksStale conversation = bigger cache write each turn even at 90% hit.

Mixing TTLs at the API layer

Both 1h and 5m can coexist in a single API request. Constraint: 1h cache entries must appear before any 5m entries in the prefix. Billing partitions into three positions: A (highest cache hit), B (highest 1h breakpoint after A), C (last cache breakpoint). Charged: read for A, 1h write for (B - A), 5m write for (C - B).

For Claude Code users this is moot — Anthropic chose 5m default and exposes no flag to flip TTL. API users can set ttl: "1h" per breakpoint.

When cache awareness does not matter

  • One-off short sessions (under 30K tokens, single message exchange).
  • Sessions where prefix invalidation is unavoidable (rapid CLAUDE.md iteration, MCP debugging).
  • Pro/Max subscribers who never approach rate limits — cache misses cost rate-limit only, not \$.

Practical takeaway

The cache TTL regression is silent and the cost is real — \$1.25 per cold-write on a 200K-token Opus session, multiplied by however many idle gaps fall over 5 minutes. The user-controllable levers are: keep sessions active (filler turn or a keepalive), slim CLAUDE.md (move workflow detail to lazy-loaded skills), lock MCP servers up front, pin the model per session, and use subagents for verbose ops so heavy tokens never enter the cached prefix. /compact is cache-safe by design — use it. Watch /usage for cache hit ratios below 90%; that’s the signal that something in your setup is invalidating the prefix.

References

Comments

enko