If your application sends the same long preamble over and over, you are paying to process those identical tokens again on every single call. Prompt caching is the technique that fixes this, and understanding it is one of the highest-leverage cost optimizations available to an LLM developer.
This article explains what caching does, where the savings come from, and how to structure prompts on Model Database to take advantage of it.
The repeated-context problem
Most production prompts have two parts: a large stable section and a small variable section. The stable part might be a system prompt, a style guide, few-shot examples, or a chunk of reference documentation. The variable part is the user's actual question.
Imagine a support assistant whose system prompt is 3,000 tokens of policies and examples. If the user message is only 50 tokens, then 98% of your input on every request is identical to the last one. Without caching, you process all 3,050 tokens every time.
What caching actually saves
Caching lets the provider reuse the already-processed representation of that stable prefix instead of recomputing it. The practical effect: repeated input tokens are billed at a reduced rate compared with fresh input tokens. You still pay full price for the variable suffix and for output tokens, but the big static block becomes much cheaper after the first call.
Illustrative math: suppose you make 50,000 calls a day that each share a 3,000-token prefix. That is 150 million repeated input tokens daily. Even a partial discount on that volume is a meaningful line item, and it costs you nothing but a small change in how you order your messages.
Structure prompts for cache hits
Caching works on prefixes, so the golden rule is: put the stable content first and the variable content last. Anything that changes between requests should live at the end of the prompt.
{
"model": "anthropic/claude-sonnet-4-6",
"messages": [
{"role": "system", "content": "<large stable policy + examples>"},
{"role": "user", "content": "<short variable question>"}
]
}
If you interleave a timestamp, a request ID, or a random token early in the prompt, you break the prefix and lose the hit. Keep volatile values out of the cached region.
Confirm the savings with the charge headers
You do not have to take caching on faith. Model Database returns the exact cost of each call, so you can send the same request twice and watch the second one come in cheaper.
curl -sD - https://modeldatabase.com/v1/chat/completions \
-H "Authorization: Bearer mdb_live_..." \
-H "Content-Type: application/json" \
-d @request.json | grep -i x-mdb-charged
# first call: X-MDB-Charged-USD: 0.0041
# second call: X-MDB-Charged-USD: 0.0017
Log X-MDB-Charged-USD across a sample of traffic and you can measure your real cache hit rate as a dollar figure, not a guess.
Design patterns that maximize hits
- Stable system prompts: finalize your instructions and few-shot examples so they do not churn between deploys.
- Document Q&A: place the document once at the top and append each user question at the bottom, so a series of questions about the same document all hit the cache.
- Batch by shared context: group requests that share a prefix together in time, since caches are most effective when hits arrive close together.
- Avoid early personalization: if you must inject user-specific data, see whether it can move toward the end of the prompt.
When caching does not help
Caching is worthless if every request is unique. A one-shot creative writing tool with no shared system prompt has nothing to reuse. The technique shines specifically when a large block of context repeats, so spend your effort on the endpoints where that is true and measure the rest with the charge headers.
Want to see caching pay off in real dollars? Send a couple of repeated requests and compare the headers, then track the trend on your dashboard. Model and rate details are on the pricing page.