4 min read
Last Tuesday, a founder showed me their Claude API usage dashboard. They were burning $47K/month. The problem wasn’t their RAG pipeline or their vector database. It was a 1,847 token system prompt running with every single API call.
The Problem
Here’s what I found in their codebase:
SYSTEM_PROMPT = """
You are an AI assistant for [Company]. You help users with...
[1,847 tokens of instructions, examples, and guidelines]
"""
def get_response(user_query):
response = anthropic.messages.create(
model="claude-4-opus",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_query}
]
)
return response
Here is the details made their eye twitch:
Daily requests: 50,000
System prompt: 1,847 tokens
User query average: 53 tokens
Response average: 215 tokens
Daily system prompt tokens: 92,350,000
Monthly (30 days): 2,770,500,000 tokens = 2,770.5 million
At Claude Opus pricing ($15/million input tokens):
Monthly cost just for system prompt: $41,557.50
The system prompt was identical for 94% of their traffic. They were literally paying Anthropic to read the same instructions 47,000 times per day.
Why This Happens
The mental model is wrong. Engineers think of system prompts like configuration files, you set them once per request to ensure consistent behavior.
But here’s what’s actually happening in production:
User asks: “What’s our refund policy?” System reads: 1,847 tokens of instructions including how to write code, format emails, handle complex calculations System answers: “Our refund policy is 30 days…” Cost: $0.093 for what should cost $0.004
The transformer doesn’t need to re-read your entire employee handbook to answer a simple question. After processing the same instructions 10,000 times, the model isn’t learning anything new, you’re just paying for redundant attention computations.
Fix: Dynamic Context Injection
Stop sending static prompts. Start sending relevant context:
class DynamicPromptRouter:
def __init__(self):
self.base_prompt = "You are a helpful AI assistant." # 7 tokens
# Context modules (load only what's needed)
self.contexts = {
"code": "Format code with syntax highlighting...", # 187 tokens
"refunds": "Company refund policy: 30 days...", # 93 tokens
"technical": "Use precise technical language...", # 156 tokens
"sales": "Focus on value proposition...", # 211 tokens
}
def get_context(self, user_query):
# Classify intent (cached for similar queries)
intent = self.classify_intent(user_query) # Fast classifier
relevant_contexts = []
if "code" in user_query.lower() or "function" in user_query.lower():
relevant_contexts.append(self.contexts["code"])
if "refund" in user_query.lower() or "return" in user_query.lower():
relevant_contexts.append(self.contexts["refunds"])
# Only include what's needed
return self.base_prompt + " ".join(relevant_contexts)
You can cache prompt-response pairs at the infrastructure level:
class PromptCache:
def __init__(self, ttl=3600):
self.cache = {} # In production, use Redis
self.ttl = ttl
def get_cached_context(self, user_query_embedding):
# Find similar recent queries
for cached_embedding, context in self.cache.items():
similarity = cosine_similarity(user_query_embedding, cached_embedding)
if similarity > 0.95: # Nearly identical query
return context["prompt"], context["response_template"]
return None, None
Results were quite something
I implemented this for three companies last month (all using Claude Opus at $15/million input tokens):
Company | Daily Requests | Before | After | Reduction | Monthly Savings |
---|---|---|---|---|---|
SaaS Startup (Series A) | 50,000 | 2,134 tokens/req | 267 tokens/req | 87.5% | $42,007 |
E-commerce Platform | 40,000 | 1,887 tokens/req | 412 tokens/req | 78.2% | $26,550 |
Dev Tools Company | 35,000 | 2,455 tokens/req | 189 tokens/req | 92.3% | $35,689 |
The e-commerce platform had the most interesting pattern. Their system prompt included instructions for handling 47 different types of customer queries. But their actual traffic showed 72% were product questions needing 3 instruction blocks, 18% were order status needing 2 blocks, 7% were refunds needing 2 blocks, and only 3% needed full context. They were sending 47 instruction blocks for queries that needed 2-3.
Advanced Optimization: Prompt Compilation
Here’s where it gets technically interesting. You can “compile” prompts into embeddings and reuse computation:
class CompiledPrompt:
def __init__(self):
# Pre-compute embeddings for common contexts
self.compiled_contexts = {}
def compile_context(self, context_name, prompt_text):
# This is pseudocode - actual implementation depends on model
# But the concept is valid: pre-compute attention patterns
embedding = model.encode(prompt_text)
attention_pattern = model.compute_attention(prompt_text)
self.compiled_contexts[context_name] = {
"embedding": embedding,
"attention": attention_pattern,
"token_count": len(tokenizer.encode(prompt_text))
}
def get_compiled(self, required_contexts):
# Return pre-computed representations
# Model can skip redundant attention computation
return [self.compiled_contexts[ctx] for ctx in required_contexts]
Uncomfortable Parts
Your 2,000 token system prompt is a security blanket. You include every possible instruction because you’re afraid the model will misbehave without them. But here’s what actually happens:
Token 1-500: Core behavior instructions (useful) Token 501-1000: Edge cases that happen 1% of the time Token 1001-1500: Examples for scenarios that never occur Token 1501-2000: Defensive instructions that duplicate earlier ones
The model’s attention mechanism starts ignoring redundant instructions after ~500 tokens anyway. You’re paying for tokens the model isn’t even processing meaningfully.
Implementation Checklist
Week 1: Audit
# Run this on your logs
def audit_prompt_usage():
unique_prompts = set()
prompt_frequency = {}
for request in api_logs:
prompt = request["system_prompt"]
unique_prompts.add(prompt)
prompt_frequency[prompt] = prompt_frequency.get(prompt, 0) + 1
print(f"Unique prompts: {len(unique_prompts)}")
print(f"Most common prompt runs: {max(prompt_frequency.values())} times")
print(f"Redundant token cost: ${calculate_redundant_cost(prompt_frequency)}")
Week 2: Classify and Route - Build intent classifier (can use a tiny model like DistilBERT), map intents to minimal required context, start with top 80% of traffic patterns.
Week 3: Cache and Optimize - Implement embedding cache for similar queries, add prompt compilation for frequent contexts, monitor latency impact (should improve!).
Week 4: Production Rollout - A/B test response quality, track cost reduction, document edge cases for future context modules.
The Bottom Line
Every time you send a static 2,000 token system prompt with a 50 token user query, you’re paying 40x more than necessary for redundant context. The fix isn’t complicated, it’s a mindset shift from “global configuration” to “dynamic context.”
One founder reduced their Claude API bill from $52K to $8K per month with four days of engineering work. They didn’t change models. They didn’t reduce functionality. They just stopped sending the same instructions 50,000 times per day.
Your system prompt is not a constitution. It’s a context buffer. Use it wisely.