Let's say your LLM API bill just tripled. But you haven't changed a thing. It's the same product, same features, and so is the user count.
However, along with the usage, the cost of sending the same system prompt, few-shot examples, and context documents to the model thousands of times a day also grew.
Every request starts from scratch. The model reads your 3,000-token system prompt, processes it, builds its internal representation, gives a response, and then throws all that. For the next request, it's the same system prompt, processing, and cost.
Multiply that by ten thousand requests a day, and you start to understand why teams are staring at invoices that grew faster than their revenue.
Prompt caching exists to fix this. But most people misunderstand what it does. They think it caches the answer. It does not.
It caches the question, or more precisely, the computational work the model does to understand your question before generating a response.
It is the difference between a feature you turn on and forget and a cost-optimisation strategy that requires you to rethink how you structure every prompt.
Let’s dig in!
What the Model Actually Does With Your Prompt
To understand prompt caching, you first need to understand what happens inside the model when it receives your input.
When an LLM processes a prompt, it does not simply "read" it the way a human reads a paragraph. It applies a series of mathematical transformations, called the attention mechanism, to every token. For each token, the model computes two things:
a key (which represents what this token is about)
a value (which represents what information this token carries).
These key-value pairs, called the KV cache, are what the model uses to understand the relationships between all the tokens in your prompt.
Think of it like a librarian cataloguing books.
The key is the catalogue entry (title, author, subject). The value is the book itself.
Once every book is catalogued, the librarian can answer questions quickly by cross-referencing the catalogue. But building it takes time and effort.
For a 3,000-token system prompt, the model builds 3,000 sets of key-value pairs across every layer of the network. That is a significant amount of computation.
And in a standard API call, all that work is dumped once the response is complete.
The next request comes with the same system prompt. The model builds the same catalogue from scratch every single time.
Prompt Caching: Saving the Catalogue
Prompt caching happens when the provider doesn't throw the catalogue.
When you send a request, the provider checks whether the beginning of your prompt matches something it has already processed. If it finds a match, it reuses the stored key-value pairs for that matching prefix instead of recomputing them.
The model skips straight to the part of the prompt that is new, typically the user's actual question, and only computes fresh KV pairs for that. The result is dramatic.
Anthropic's Claude offers cached input tokens at 90% off the standard price.
OpenAI matches this discount on their latest models. Google's Gemini takes a different approach, charging based on storage duration, but the savings are comparable.
In practice, teams report cost reductions of 50-70% on their API bills after implementing prompt caching properly.
The latency improvements are equally important. Skipping the computation for thousands of tokens means the model can start generating its response much faster.
OpenAI reports up to 80% latency reduction on cached requests. For user-facing applications where response time directly affects experience, this is substantial.
The Confusion: Prompt Caching vs KV Cache
There are two different things called "caching" in the LLM world.
And they work at different layers. KV cache is an optimisation that happens automatically during a single inference request.
When the model generates text token by token, it caches the key-value pairs from previous tokens so it doesn't recompute them at each step.
It is internal to the model. It is always on. Prompt caching is a feature offered by API providers that saves the KV states across separate requests.
When you send a new API call that starts with the same prefix as a previous call, the provider reuses the cached computation from the earlier request.
That is what saves you money. The KV cache works within a single request.
Prompt caching works across requests. One is an engineering optimisation you never see. The other is a cost lever you need to design for.
The Golden Rule: Static First, Dynamic Last
Here is where prompt caching stops being a feature you turn on and becomes a discipline you practice. Prompt caching works on prefixes.
The provider matches your prompt from the beginning.
The moment your prompt diverges from what was cached, the cache stops helping. Everything after the divergence point is computed fresh.
It means the structure of your prompt directly tells how much money you save.
The golden rule is simple: put the parts that never change at the top, and put the parts that change with every request at the bottom.

A well-structured prompt for caching could be: system instructions first, examples, reference documents, the user's conversation history, and the user's current message.
The system instructions and examples are identical across every request. They hit the cache every time. The user's message is unique. It is computed fresh.
Everything in between falls on a spectrum. A poor prompt scatters dynamic content throughout. A user ID in the system prompt. A timestamp in the instructions.
A session-specific variable in the middle of your few-shot examples. Any of these will break the prefix match and force the model to recompute everything that follows.
ProjectDiscovery, an open-source security company, documented this exactly. Their first implementation of prompt caching saved 59% LLM costs.
After restructuring their prompts to maximise the static prefix, moving all variable content to the end, savings climbed to 70%.
How Each Provider Handles It
The three major providers all offer prompt caching, but with different trade-offs.

Anthropic (Claude) requires you to explicitly mark which parts of your prompt should be cached using cache control headers. Cache entries live for 5 minutes by default, with a 1-hour option available at a higher write cost.
Cached reads cost 10% of the standard input price, a 90% discount. The manual approach gives precise control but requires more implementation work.
OpenAI handles caching automatically. If your prompt is 1,024 tokens or longer and matches an earlier prefix, it caches it without code changes.
Cache hits are matched in 128-token increments. The discount is up to 50% on most models and up to 90% on the latest GPT-5.4 family. The simplicity is the selling point.
Caching happens in the background without any manual effort.
Google (Gemini) takes a storage-based approach. You explicitly create a cached context with a configurable lifetime, from minutes to days or weeks.
You pay for token storage over time. It suits workloads where a large document or codebase needs to persist across many sessions over an extended period.
The strategic choice depends on your workload. Anthropic suits high-volume batched processing, and you control prompt structure precisely.
OpenAI fits best for interactive applications where simplicity matters.
Google targets scenarios with long-lived context.
Also read: What are Managed Agents by Claude?
When Caching Does Not Help
Prompt caching is not a universal fix. It helps most when your prompts have a large, stable prefix that repeats across many requests. It helps the least in three situations.
If every request is unique, with different system prompts, context, and examples, there is nothing to cache. It is rare in production but common in exploratory or ad-hoc usage.
If your prompts are short, the savings are negligible. OpenAI requires at least 1,024 tokens before caching activates. If your prompt is 500 tokens, there is nothing to gain.
If your cache window expires between requests. Anthropic's default 5-minute window means that low-traffic applications may never hit the cache. If your product handles three requests per hour, the cache empties between each one.
The teams seeing the biggest savings share a pattern.
It’s high volume, long prompts, stable prefixes. An ideal customer support bot handling hundreds of queries per hour using the same system prompt and knowledge base.
A developer testing prompts in a notebook is the worst case.
The Cost Equation You Should Run
Before implementing prompt caching, run this calculation.
Take your average prompt length. Identify how many tokens are static (system prompt, examples, reference docs) versus dynamic (user message, conversation history).
Multiply the static portion by your daily request volume.
That is the computation you are currently paying for and throwing away.
If 70% of your prompt is static and you send 10,000 requests per day, you are recomputing the same 70% of work ten thousand times.
With caching, you compute it once and reuse it 9,999 times at 10% of the cost.
The engineering effort is not zero. You need to restructure prompts, choose a provider's caching mechanism, and monitor cache hit rates.
But for any product at a scale, the math favours caching. The question is not whether to implement it, but how much money you are burning by not having done so.
Prompt caching is one of those rare optimisations where the upside is enormous, and the downside is close to zero.
The only requirement is that you structure your prompts with caching in mind from the start, because retrofitting it later means rewriting every prompt in your system.
