Prompt Caching Cut Our AI Costs by 70%
We were bleeding money on AI API costs. Then we discovered prompt caching.
The Problem
Our AI customer service bot was handling 100K conversations/month. Each conversation:
- 4KB system prompt
- 2KB few-shot examples
- 1KB user context
- Variable user messages
Cost: $50,000/month 💸
The Insight
90% of our tokens were static:
- System prompt: Same for everyone
- Few-shot examples: Rarely change
- User context: Same within a session
We were paying to send the same tokens over and over.
Solution: Prompt Caching
Anthropic's Cache
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
const response = await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt, // 4KB of instructions
cache_control: { type: "ephemeral" } // 👈 Cache this
},
{
type: "text",
text: fewShotExamples, // 2KB of examples
cache_control: { type: "ephemeral" } // 👈 Cache this too
}
],
messages: [{ role: "user", content: userMessage }]
});
// Cache hit = 90% cost reduction on cached tokens!
OpenAI's Approach
// OpenAI uses automatic caching for identical prefixes
// Just structure your prompts consistently:
const messages = [
{ role: "system", content: STATIC_SYSTEM_PROMPT },
{ role: "user", content: STATIC_CONTEXT },
{ role: "user", content: dynamicUserInput }, // Only this varies
];
// OpenAI automatically caches the static prefix
Self-Hosted with Semantic Cache
import { SemanticCache } from "ai-cache";
const cache = new SemanticCache({
redis: redisClient,
embedding: openai.embeddings,
threshold: 0.95, // Similarity threshold
});
async function query(prompt: string) {
// Check semantic cache
const cached = await cache.get(prompt);
if (cached) return cached;
// Miss - call API
const response = await llm.complete(prompt);
await cache.set(prompt, response);
return response;
}
Advanced Strategies
1. Prompt Deduplication
// Before: Redundant context in every message
messages.push({ role: "user", content: `Context: ${ctx}\n\nQuestion: ${q1}` });
messages.push({ role: "user", content: `Context: ${ctx}\n\nQuestion: ${q2}` });
// After: Context once, questions reference it
messages.push({ role: "system", content: `Context: ${ctx}` });
messages.push({ role: "user", content: q1 });
messages.push({ role: "user", content: q2 });
2. Response Caching
// Cache entire responses for common queries
const cacheKey = hash(prompt + model + temperature);
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
3. Tiered Models
// Use cheaper models for simple queries
const complexity = await classifier.analyze(query);
const model = complexity > 0.7 ? "gpt-4" : "gpt-3.5-turbo";
Results
| Metric | Before | After | |--------|--------|-------| | Monthly cost | $50,000 | $15,000 | | Avg latency | 2.1s | 0.8s | | Cache hit rate | 0% | 73% |
Key Takeaways
- Audit your prompts - Find the static parts
- Structure for caching - Put static content first
- Monitor hit rates - Optimize based on data
- Combine strategies - Caching + model routing + response cache
The best AI cost optimization doesn't sacrifice quality. It eliminates waste.