Categories
AI Learn

What is prompt caching, and why is it the single biggest lever on your AI bill in 2026

Prompt caching is the single biggest lever on your 2026 AI bill, a 5x to 10x cost cut that most teams still aren’t using correctly. Here’s how it works on Anthropic, OpenAI, and Google, plus the two ways teams blow the savings.

If you’re running anything on top of a large language model in 2026, an agent, a RAG pipeline, a customer-support bot, a code assistant, and you aren’t using prompt caching, you are probably paying between 5× and 10× more than you need to. That’s not a marketing claim; it’s what the pricing tables on Anthropic’s, OpenAI’s, and Google’s own API docs say when you do the math.

This post is a plain-English walkthrough of what prompt caching is, how it actually works under the hood, which provider charges what, and the two or three ways teams still get it wrong. No prior ML background assumed.

The short answer

Prompt caching is a way to tell an LLM API that the first several thousand tokens of a request will be the same every time, so it should not re-read them from scratch but pick up from where it left off. The API does, and it charges you roughly 10% of the normal input-token price for the re-read portion. Anthropic gives the deepest discount (90% off reads), Google gives 75% off reads, OpenAI gives 50% off reads, and Google and OpenAI do it automatically for any prompt long enough to qualify.

In return, the first request (the “write”) costs a bit more than normal, 25% extra on Anthropic for a 5-minute cache, 100% extra for a 1-hour cache. You break even on Anthropic’s 5-minute cache after one hit, and on the 1-hour cache after two.

If your prompts reuse any significant prefix, a system prompt, a RAG context, tool schemas, few-shot examples, you should be caching. If they don’t, you shouldn’t. That’s really it.

The long answer

What the model is actually doing

A transformer-based LLM processes your prompt by turning each token into a vector, then running that vector through dozens of attention layers. At each layer, the model computes two big matrices, called Keys and Values, or K and V, that encode how every token in the prompt relates to every other token. Generating even a single new output token requires the K and V matrices for the entire prompt.

Normally, every API request recomputes K and V from scratch. That’s the expensive part, the matrix multiplications scale with the square of the prompt length, which is why long prompts are disproportionately slow and costly.

Prompt caching changes the contract. On the first request, the provider computes K and V for your prompt as usual, but also saves them in memory, keyed to the exact sequence of tokens that produced them. On subsequent requests that start with the same prefix, the provider jumps straight to the cached matrices and only computes fresh K and V for the tokens that come after the cache boundary. You pay for the skipped computation at a steep discount, because the cost to the provider is essentially zero, it’s just reading back memory it already had.

That’s why the discount is 90% rather than 100%: there’s still memory bandwidth, occasional cache misses where the entry was evicted, and some housekeeping. But it’s close to free.

The “bookmark” analogy

If the above was too dense, here’s the version that holds up:

Imagine you’re reading a dense reference book to answer questions for a stream of callers. The first caller asks something, and you read the whole introduction, slow, thorough, expensive. But now the introduction is loaded in your head. When the second caller asks a different question, you don’t reread the intro; you bookmark where you stopped and jump straight to their question, using the context you already have. The bookmark costs something to place, you had to pay attention to create it, but every subsequent jump is 90% cheaper than re-reading.

Prompt caching is the bookmark. The static prefix is the introduction. Your dynamic question is whatever comes after the bookmark.

What each provider charges in April 2026

Anthropic (Claude Sonnet 4.6 at $3 per million input tokens as the baseline):
– Write cost (5-min cache): 1.25× base = $3.75/MTok
– Write cost (1-hour cache): 2.0× base = $6.00/MTok
– Read cost: 0.10× base = $0.30/MTok (90% discount)
– Requires explicit cache_control on the prompt block
– Up to 4 cache breakpoints per request
– Minimum useful size: ~1,024 tokens

OpenAI (GPT-4 class, automatic):
– Write cost: same as normal input price (no premium)
– Read cost: 50% of input price
– Fully automatic for prompts ≥1,024 tokens, no code changes required
– Cache TTL: 5–10 minutes, not configurable

Google Gemini:
– Write cost: free to cache
– Read cost: 25% of input price (75% discount, most aggressive)
– Storage fees: charged per hour cached
– Minimum prompt size: 32,768 tokens (highest barrier to entry)
– Default TTL: 1 hour, configurable up to 24 hours

Three very different philosophies. Anthropic treats caching as an explicit optimization with the deepest payoff and the most knobs. OpenAI treats it as a transparent efficiency gain users shouldn’t have to think about. Google treats it as an explicit long-context service with the steepest read discount but a 32K minimum that rules out most conversational use.

When caching helps vs. when it hurts

Caching is a win when:
– Your system prompt is long and doesn’t change between requests (agents, RAG systems, persona-heavy chatbots)
– You run many-shot examples or tool schemas as part of every call
– You batch-process documents against the same instructions
– You hold code files in context across a coding session

Caching hurts, or does nothing, when:
– Your prompts are short (below the minimum threshold)
– Every request has a fundamentally different prefix
– You make requests infrequently enough that the cache expires between hits (Anthropic 5-min, OpenAI 5–10 min)
– Your workload is output-heavy with short inputs, the caching discount only applies to input tokens

Why this matters in 2026

Agentic AI is the reason. Every agent loop, whether it’s Claude Code iterating on a repo, a GPT-based research assistant walking through a browser, or a Gemini pipeline summarizing ten documents at once, shares a huge fixed prefix (system prompt, tool definitions, often a long document context) across dozens of model calls. Without caching, you pay full freight for that prefix on every turn. With caching, you pay it once per session plus 10% per turn thereafter. At typical agent loop lengths (10–50 turns), caching is the difference between an application that’s economical and one that isn’t.

The other reason is that the default cost ceiling is rising. As frontier models get more expensive per token (Anthropic’s Mythos Preview preview pricing is 3× Opus 4.6), the gap between the cached and uncached price grows in absolute terms. Teams that adopted caching in 2024 got a 5× discount on a cheap model. Teams adopting in 2026 are getting a 10× discount on a much more expensive one.

Common misconceptions

  1. All cached tokens cost 90% less. Only the read tokens. Writes cost extra (25–100% on Anthropic). If you never get a cache hit, one-off requests, infrequent workloads, prefixes that always change, caching makes your bill higher.

  2. Cache everything with the longest TTL. Longer TTLs cost more up front. Anthropic’s 1-hour cache doubles your write cost. If your reuse pattern is under 5 minutes, don’t pay for an hour.

  3. Placement doesn’t matter. Where you put the cache breakpoint is the whole game. If your static system prompt is followed by a timestamp, a user ID, or any dynamic content before the breakpoint, the cache key changes on every request and you never hit. Rule: stable content first, dynamic content last, cache breakpoint between them.

Where to learn more

Sources