Cost Optimization
Bill above target? Work through these five practical changes, ordered by impact. Start with #1.
Many settings below can be handled by just telling Rose — no terminal required. For example, “Switch Ada’s default model to GPT-4o-mini.” Manual commands are provided for users who prefer them.
1. Switch Model (Biggest Impact)
Savings: 50%–95%
Simplest change. One setting, over 50% cost reduction.
When a Cheap Model Fits
| Task | Recommended |
|---|---|
| Customer FAQ replies | GPT-4o-mini / Claude Haiku / Gemini Flash |
| Translation, summarization | GPT-4o-mini |
| Chit-chat, companionship | GPT-4o-mini |
| Simple classification, tagging | Gemini Flash |
| Bulk processing | Gemini Flash / Groq |
When an Expensive Model Is Worth It
| Task | Recommended |
|---|---|
| Code generation and debugging | Claude 3.5 Sonnet |
| Complex reasoning, analysis | GPT-4o / o1 |
| Long-context (100K+) | Gemini 1.5 Pro |
| Creative writing | Claude 3.5 Sonnet |
Setup
Tell Rose:
“Switch Ada’s default model to GPT-4o-mini.”
Or manually: Admin Panel → Ada Dashboard → Settings → Model → pick a new model.
Use a tiered strategy:
- Default to the cheap model
- Escalate to the expensive model only for specific tasks
- OpenClaw and Hermes both support dynamic model switching mid-workflow
2. Enable Context Compression
Savings: 40%–70%
The problem: long conversations accumulate history, driving cost per turn. Turn 50 might carry 50,000 tokens of history.
The fix: auto-summarize older turns.
Mechanics
Without compression:
Turn 100 = full history (50,000 T) + current turn (500 T) = 50,500 T
With compression:
Turns 1–90 → summarized to 2,000 T
Keep most recent 10 turns verbatim: 5,000 T
Current turn: 500 T
= 7,500 T
6.7× difference — long conversations save over 80%.
Setup
Tell Rose:
“Enable context compression for Ada — summarize anything older than 20 turns.”
Or manually:
openclaw config set memory.contextCompression.enabled true
openclaw config set memory.contextCompression.triggerAfterTurns 20
openclaw config set memory.contextCompression.keepRecentTurns 10
Trade-off
After compression, old details blur. Important facts (company name, customer ID, order number) should live in the memory system (memory-wiki), not conversation history.
3. Cap Reply Length
Savings: 30%–60%
AI defaults to fully answering every question — sometimes writing a page when a paragraph suffices.
Setup
Tell Rose:
“Cap Ada’s replies at 300 tokens unless I explicitly ask for a long answer.”
Or manually:
openclaw config set model.maxTokens 300
Suggested Caps
| Use case | maxTokens |
|---|---|
| Support FAQs | 200–400 |
| General chat | 500 |
| Email / copywriting | 1,000–2,000 |
| Long-form analysis | 4,000+ |
Advanced: Add Rules to the System Prompt
In Ada’s AGENTS.md (Personality) add:
## Response Style Rules
- Unless explicitly asked, keep replies under 5 sentences
- Prefer bullet points for long explanations (saves tokens)
- Do not repeat the question as a preamble
These rules cost tokens themselves (each request carries them), but save more in output.
4. Enable Response Caching
Savings: 60%–90% (only for repeated questions)
Many support flows have tons of repeat questions: “How do returns work?”, “What are your hours?”, “What’s the price?”
Each time the AI recomputes = wasted spend.
Two Caching Modes
Mode A: Prompt Caching (native on OpenAI / Anthropic)
openclaw config set model.enablePromptCaching true
The input portion (system prompt, agent personality, tool definitions) gets cached — on a hit, it’s 10% off.
Mode B: Response Caching (full response cache)
Cache full responses for frequent questions; subsequent identical questions hit the cache directly.
openclaw config set responseCaching.enabled true
openclaw config set responseCaching.ttlSeconds 3600
openclaw config set responseCaching.minSimilarity 0.95
minSimilarity 0.95 = only serve cached responses when the new question is ≥95% similar to the cached one.
5. Division of Labor (Architectural)
Savings: 30%–80%, situational
Use all three companions; concentrate expensive work on the minority of cases.
Common Architecture
Customer LINE message
↓
[Ada — GPT-4o-mini] quick triage
├─ Standard FAQ → replies directly (cheap)
├─ Complex question → hands off to Rose (GPT-4o)
└─ Data lookup → hands off to Vi (MCP tool)
Effect:
- 80% of messages go through GPT-4o-mini ($0.15/M)
- 15% escalate to GPT-4o ($2.50/M)
- 5% handled by Vi (moderate cost)
Weighted average cost is far below an all-GPT-4o architecture.
Setup
See Multi-Agent Workflow for full configuration examples.
Secondary Optimizations (Smaller, Additive)
Disable Unused Tools
Every turn sends tool definitions as input tokens. More tools = bigger input.
# List enabled tools
openclaw tool list
# Disable what you don't use
openclaw tool disable web-search
openclaw tool disable image-generation
Rate Limiting
# Max 100 API calls per hour
openclaw config set rateLimit.requestsPerHour 100
# Each user capped at 50 per day
openclaw config set rateLimit.perUserPerDay 50
Prevents abuse or bot attacks from burning cash.
Disable Streaming (Situational)
Streaming (token-by-token) can cost extra on some providers. If the UX does not need it, disable:
openclaw config set channels.*.streaming "off"
Worked Example: Going From $800 to $150
This is an illustrative scenario. Actual savings depend on your conversation patterns, volume, and model mix. No specific numbers are guaranteed.
Starting point:
- E-commerce support, 200 conversations a day
- Entirely on GPT-4o
- No context compression
- No response caps
- $800 / month
Optimization steps:
- Switch model — default to GPT-4o-mini, escalate only on complex questions: $800 → $280
- Context compression — summarize after 15 turns: $280 → $200
- Response cap 400 tokens — $200 → $170
- FAQ prompt caching — $170 → $150
Four changes, 81% savings.
Related Docs
- Usage Dashboard Reference
- How Tokens Are Priced
- Budget Alerts
- Engine Selection Guide — engine choice also affects cost
- Multi-Agent Workflow — division-of-labor details