Source Code
The runnable TypeScript source for this lesson is in
lessons/08-cache-hitting/
Lesson 08: Prompt Caching - Reducing LLM API Costs¶
Overview¶
Prompt caching is a technique that dramatically reduces API costs for multi-turn conversations and repeated requests. Instead of processing the same tokens over and over, providers can cache and reuse processed prefixes.
Why Caching Matters¶
In an agent loop, you resend the entire conversation history on every turn:
Turn 1: [System prompt] + [User message] -> ~2,000 tokens
Turn 2: [System prompt] + [User] + [Response] -> ~2,500 tokens
Turn 3: [System prompt] + [User] + [Response] x 2 -> ~3,000 tokens
Turn 10: [System prompt] + [Full history] -> ~7,000 tokens
Without caching, you're paying for the system prompt (~2,000 tokens) ten times.
Cost Comparison¶
| Scenario | Without Cache | With Cache | Savings |
|---|---|---|---|
| 10-turn conversation | ~35,000 tokens | ~15,000 effective | 57% |
| 50-turn conversation | ~250,000 tokens | ~75,000 effective | 70% |
How Prompt Caching Works¶
The Mechanism¶
- Cache Write: First request with a cache marker incurs a small write fee
- Cache Hit: Subsequent requests with the same prefix get ~90% discount
- TTL: Cached content expires after ~5 minutes of inactivity
What Gets Cached¶
// Mark content for caching with cache_control
const message = {
role: 'system',
content: [
{
type: 'text',
text: 'You are a helpful coding assistant...',
cache_control: { type: 'ephemeral' } // <- This marker
}
]
};
Supported Models (via OpenRouter)¶
| Provider | Models | Cache TTL |
|---|---|---|
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Haiku | 5 minutes |
| Gemini 2.0 Flash, Gemini 1.5 Pro | 5 minutes | |
| DeepSeek | All models | Varies |
What to Cache vs. What Not to Cache¶
Good Caching Candidates¶
- System prompts - Static instructions, don't change between turns
- Tool definitions - Same tools available throughout conversation
- Large context - Documentation, file contents provided upfront
- Conversation history - Old messages that won't change
Poor Caching Candidates¶
- Latest user message - Changes every turn
- Small content - Overhead exceeds benefit (<1,000 tokens)
- Dynamic content - Timestamps, changing state
Cache Breakpoints¶
A critical concept: caching works on prefixes. If you insert content in the middle, the cache breaks.
Request 1: [A] [B] [C] <- Cache stores "ABC"
Request 2: [A] [B] [C] [D] <- Cache HIT on "ABC", only D is new
Request 3: [A] [X] [B] [C] <- Cache MISS - X breaks the prefix!
Best Practice: Put static content at the start, dynamic content at the end.
Architecture Pattern¶
+-------------------------------------------------------------+
| CacheAwareProvider |
| |
| +-----------------------------------------------------+ |
| | Automatic Cache Marker Injection | |
| | - System prompts -> always cached | |
| | - Tool definitions -> always cached | |
| | - User context > threshold -> cached | |
| +-----------------------------------------------------+ |
| | |
| v |
| +-----------------------------------------------------+ |
| | Underlying Provider (OpenRouter) | |
| | - Passes cache_control to API | |
| | - Tracks cached_tokens in response | |
| +-----------------------------------------------------+ |
| | |
| v |
| +-----------------------------------------------------+ |
| | Statistics Tracking | |
| | - Cache hits vs misses | |
| | - Estimated cost savings | |
| | - Tokens cached per request | |
| +-----------------------------------------------------+ |
+-------------------------------------------------------------+
Files in This Lesson¶
| File | Purpose |
|---|---|
cache-basics.ts |
Simple examples demonstrating cache markers |
cache-provider.ts |
Cache-aware provider wrapper |
cost-calculator.ts |
Estimate savings from caching |
examples/basic-caching.ts |
Runnable demo of basic caching |
examples/system-prompt-cache.ts |
System prompt caching pattern |
examples/multi-turn-cache.ts |
Multi-turn conversation caching |
Key Takeaways¶
- Cache static content: System prompts, tool definitions, large context
- Order matters: Static content first, dynamic content last
- Minimum size: Don't cache content under ~1,000 tokens
- Track savings: Monitor cache hits to verify benefit
- TTL awareness: Keep conversations active to maintain cache
Next Steps¶
After understanding caching, move to Lesson 09 where we integrate caching with: - Persistent context management - Multi-agent architecture - Session-based conversations