Cost Optimization

Bill above target? Work through these five practical changes, ordered by impact. Start with #1.

Many settings below can be handled by just telling Rose — no terminal required. For example, “Switch Ada’s default model to GPT-4o-mini.” Manual commands are provided for users who prefer them.

1. Switch Model (Biggest Impact)

Savings: 50%–95%

Simplest change. One setting, over 50% cost reduction.

When a Cheap Model Fits

Task	Recommended
Customer FAQ replies	GPT-4o-mini / Claude Haiku / Gemini Flash
Translation, summarization	GPT-4o-mini
Chit-chat, companionship	GPT-4o-mini
Simple classification, tagging	Gemini Flash
Bulk processing	Gemini Flash / Groq

When an Expensive Model Is Worth It

Task	Recommended
Code generation and debugging	Claude 3.5 Sonnet
Complex reasoning, analysis	GPT-4o / o1
Long-context (100K+)	Gemini 1.5 Pro
Creative writing	Claude 3.5 Sonnet

Setup

Tell Rose:

“Switch Ada’s default model to GPT-4o-mini.”

Or manually: Admin Panel → Ada Dashboard → Settings → Model → pick a new model.

Use a tiered strategy:

Default to the cheap model
Escalate to the expensive model only for specific tasks
OpenClaw and Hermes both support dynamic model switching mid-workflow

2. Enable Context Compression

Savings: 40%–70%

The problem: long conversations accumulate history, driving cost per turn. Turn 50 might carry 50,000 tokens of history.

The fix: auto-summarize older turns.

Mechanics

Without compression:
  Turn 100 = full history (50,000 T) + current turn (500 T) = 50,500 T

With compression:
  Turns 1–90 → summarized to 2,000 T
  Keep most recent 10 turns verbatim: 5,000 T
  Current turn: 500 T
  = 7,500 T

6.7× difference — long conversations save over 80%.

Setup

Tell Rose:

“Enable context compression for Ada — summarize anything older than 20 turns.”

Or manually:

openclaw config set memory.contextCompression.enabled true
openclaw config set memory.contextCompression.triggerAfterTurns 20
openclaw config set memory.contextCompression.keepRecentTurns 10

Trade-off

After compression, old details blur. Important facts (company name, customer ID, order number) should live in the memory system (memory-wiki), not conversation history.

3. Cap Reply Length

Savings: 30%–60%

AI defaults to fully answering every question — sometimes writing a page when a paragraph suffices.

Setup

Tell Rose:

“Cap Ada’s replies at 300 tokens unless I explicitly ask for a long answer.”

Or manually:

openclaw config set model.maxTokens 300

Suggested Caps

Use case	maxTokens
Support FAQs	200–400
General chat	500
Email / copywriting	1,000–2,000
Long-form analysis	4,000+

Advanced: Add Rules to the System Prompt

In Ada’s AGENTS.md (Personality) add:

## Response Style Rules
- Unless explicitly asked, keep replies under 5 sentences
- Prefer bullet points for long explanations (saves tokens)
- Do not repeat the question as a preamble

These rules cost tokens themselves (each request carries them), but save more in output.

4. Enable Response Caching

Savings: 60%–90% (only for repeated questions)

Many support flows have tons of repeat questions: “How do returns work?”, “What are your hours?”, “What’s the price?”

Each time the AI recomputes = wasted spend.

Two Caching Modes

Mode A: Prompt Caching (native on OpenAI / Anthropic)

openclaw config set model.enablePromptCaching true

The input portion (system prompt, agent personality, tool definitions) gets cached — on a hit, it’s 10% off.

Mode B: Response Caching (full response cache)

Cache full responses for frequent questions; subsequent identical questions hit the cache directly.

openclaw config set responseCaching.enabled true
openclaw config set responseCaching.ttlSeconds 3600
openclaw config set responseCaching.minSimilarity 0.95

minSimilarity 0.95 = only serve cached responses when the new question is ≥95% similar to the cached one.

5. Division of Labor (Architectural)

Savings: 30%–80%, situational

Use all three companions; concentrate expensive work on the minority of cases.

Common Architecture

Customer LINE message
    ↓
[Ada — GPT-4o-mini] quick triage
    ├─ Standard FAQ → replies directly (cheap)
    ├─ Complex question → hands off to Rose (GPT-4o)
    └─ Data lookup → hands off to Vi (MCP tool)

Effect:

80% of messages go through GPT-4o-mini ($0.15/M)
15% escalate to GPT-4o ($2.50/M)
5% handled by Vi (moderate cost)

Weighted average cost is far below an all-GPT-4o architecture.

Setup

See Multi-Agent Workflow for full configuration examples.

Secondary Optimizations (Smaller, Additive)

Disable Unused Tools

Every turn sends tool definitions as input tokens. More tools = bigger input.

# List enabled tools
openclaw tool list

# Disable what you don't use
openclaw tool disable web-search
openclaw tool disable image-generation

Rate Limiting

# Max 100 API calls per hour
openclaw config set rateLimit.requestsPerHour 100

# Each user capped at 50 per day
openclaw config set rateLimit.perUserPerDay 50

Prevents abuse or bot attacks from burning cash.

Disable Streaming (Situational)

Streaming (token-by-token) can cost extra on some providers. If the UX does not need it, disable:

openclaw config set channels.*.streaming "off"

Worked Example: Going From $800 to $150

This is an illustrative scenario. Actual savings depend on your conversation patterns, volume, and model mix. No specific numbers are guaranteed.

Starting point:

E-commerce support, 200 conversations a day
Entirely on GPT-4o
No context compression
No response caps
$800 / month

Optimization steps:

Switch model — default to GPT-4o-mini, escalate only on complex questions: $800 → $280
Context compression — summarize after 15 turns: $280 → $200
Response cap 400 tokens — $200 → $170
FAQ prompt caching — $170 → $150

Four changes, 81% savings.

Usage Dashboard Reference
How Tokens Are Priced
Budget Alerts
Engine Selection Guide — engine choice also affects cost
Multi-Agent Workflow — division-of-labor details

Cost Optimization

1. Switch Model (Biggest Impact)

When a Cheap Model Fits

When an Expensive Model Is Worth It

Setup

2. Enable Context Compression

Mechanics

Setup

Trade-off

3. Cap Reply Length

Setup

Suggested Caps

Advanced: Add Rules to the System Prompt

4. Enable Response Caching

Two Caching Modes

5. Division of Labor (Architectural)

Common Architecture

Setup

Secondary Optimizations (Smaller, Additive)

Disable Unused Tools

Rate Limiting

Disable Streaming (Situational)

Worked Example: Going From $800 to $150

Related Docs