realvco Docs

Cost Optimization

Bill above target? Work through these five practical changes, ordered by impact. Start with #1.

Many settings below can be handled by just telling Rose — no terminal required. For example, “Switch Ada’s default model to GPT-4o-mini.” Manual commands are provided for users who prefer them.


1. Switch Model (Biggest Impact)

Savings: 50%–95%

Simplest change. One setting, over 50% cost reduction.

When a Cheap Model Fits

TaskRecommended
Customer FAQ repliesGPT-4o-mini / Claude Haiku / Gemini Flash
Translation, summarizationGPT-4o-mini
Chit-chat, companionshipGPT-4o-mini
Simple classification, taggingGemini Flash
Bulk processingGemini Flash / Groq

When an Expensive Model Is Worth It

TaskRecommended
Code generation and debuggingClaude 3.5 Sonnet
Complex reasoning, analysisGPT-4o / o1
Long-context (100K+)Gemini 1.5 Pro
Creative writingClaude 3.5 Sonnet

Setup

Tell Rose:

“Switch Ada’s default model to GPT-4o-mini.”

Or manually: Admin Panel → Ada Dashboard → Settings → Model → pick a new model.

Use a tiered strategy:

  • Default to the cheap model
  • Escalate to the expensive model only for specific tasks
  • OpenClaw and Hermes both support dynamic model switching mid-workflow

2. Enable Context Compression

Savings: 40%–70%

The problem: long conversations accumulate history, driving cost per turn. Turn 50 might carry 50,000 tokens of history.

The fix: auto-summarize older turns.

Mechanics

Without compression:
  Turn 100 = full history (50,000 T) + current turn (500 T) = 50,500 T

With compression:
  Turns 1–90 → summarized to 2,000 T
  Keep most recent 10 turns verbatim: 5,000 T
  Current turn: 500 T
  = 7,500 T

6.7× difference — long conversations save over 80%.

Setup

Tell Rose:

“Enable context compression for Ada — summarize anything older than 20 turns.”

Or manually:

openclaw config set memory.contextCompression.enabled true
openclaw config set memory.contextCompression.triggerAfterTurns 20
openclaw config set memory.contextCompression.keepRecentTurns 10

Trade-off

After compression, old details blur. Important facts (company name, customer ID, order number) should live in the memory system (memory-wiki), not conversation history.


3. Cap Reply Length

Savings: 30%–60%

AI defaults to fully answering every question — sometimes writing a page when a paragraph suffices.

Setup

Tell Rose:

“Cap Ada’s replies at 300 tokens unless I explicitly ask for a long answer.”

Or manually:

openclaw config set model.maxTokens 300

Suggested Caps

Use casemaxTokens
Support FAQs200–400
General chat500
Email / copywriting1,000–2,000
Long-form analysis4,000+

Advanced: Add Rules to the System Prompt

In Ada’s AGENTS.md (Personality) add:

## Response Style Rules
- Unless explicitly asked, keep replies under 5 sentences
- Prefer bullet points for long explanations (saves tokens)
- Do not repeat the question as a preamble

These rules cost tokens themselves (each request carries them), but save more in output.


4. Enable Response Caching

Savings: 60%–90% (only for repeated questions)

Many support flows have tons of repeat questions: “How do returns work?”, “What are your hours?”, “What’s the price?”

Each time the AI recomputes = wasted spend.

Two Caching Modes

Mode A: Prompt Caching (native on OpenAI / Anthropic)

openclaw config set model.enablePromptCaching true

The input portion (system prompt, agent personality, tool definitions) gets cached — on a hit, it’s 10% off.

Mode B: Response Caching (full response cache)

Cache full responses for frequent questions; subsequent identical questions hit the cache directly.

openclaw config set responseCaching.enabled true
openclaw config set responseCaching.ttlSeconds 3600
openclaw config set responseCaching.minSimilarity 0.95

minSimilarity 0.95 = only serve cached responses when the new question is ≥95% similar to the cached one.


5. Division of Labor (Architectural)

Savings: 30%–80%, situational

Use all three companions; concentrate expensive work on the minority of cases.

Common Architecture

Customer LINE message

[Ada — GPT-4o-mini] quick triage
    ├─ Standard FAQ → replies directly (cheap)
    ├─ Complex question → hands off to Rose (GPT-4o)
    └─ Data lookup → hands off to Vi (MCP tool)

Effect:

  • 80% of messages go through GPT-4o-mini ($0.15/M)
  • 15% escalate to GPT-4o ($2.50/M)
  • 5% handled by Vi (moderate cost)

Weighted average cost is far below an all-GPT-4o architecture.

Setup

See Multi-Agent Workflow for full configuration examples.


Secondary Optimizations (Smaller, Additive)

Disable Unused Tools

Every turn sends tool definitions as input tokens. More tools = bigger input.

# List enabled tools
openclaw tool list

# Disable what you don't use
openclaw tool disable web-search
openclaw tool disable image-generation

Rate Limiting

# Max 100 API calls per hour
openclaw config set rateLimit.requestsPerHour 100

# Each user capped at 50 per day
openclaw config set rateLimit.perUserPerDay 50

Prevents abuse or bot attacks from burning cash.

Disable Streaming (Situational)

Streaming (token-by-token) can cost extra on some providers. If the UX does not need it, disable:

openclaw config set channels.*.streaming "off"

Worked Example: Going From $800 to $150

This is an illustrative scenario. Actual savings depend on your conversation patterns, volume, and model mix. No specific numbers are guaranteed.

Starting point:

  • E-commerce support, 200 conversations a day
  • Entirely on GPT-4o
  • No context compression
  • No response caps
  • $800 / month

Optimization steps:

  1. Switch model — default to GPT-4o-mini, escalate only on complex questions: $800 → $280
  2. Context compression — summarize after 15 turns: $280 → $200
  3. Response cap 400 tokens — $200 → $170
  4. FAQ prompt caching — $170 → $150

Four changes, 81% savings.