Skip to content
Skip to main content
AI Agent Cost Optimization - Claude API Bill Reduction
16 min readBy Carlos Aragon

AI Agent Cost Optimization: How I Cut My Claude API Bill by 60%

Running 11 agents in production gets expensive fast. Here's exactly how I cut my Claude API bill by 60% — prompt caching, model tiering, request batching, and prompt compression. Real numbers from my production stack.

The Bill That Made Me Actually Look at the Numbers

February 2026. I opened my Anthropic invoice and stared at it for a long time. Eleven agents running across my stack — BellaBot for call screening, HyrosBot for attribution pulls, DevBot for deploy pipelines, Orus as the orchestration brain, and seven smaller task agents for content, research, and client ops. The total: $847 for the month.

That's not catastrophic. For the automation value delivered, the ROI was still positive. But when I broke down the number by agent and task type, I found something uncomfortable: I was paying Opus prices for work that Haiku could do in its sleep. I was re-sending 4,000-token system prompts on every API call when I could have cached them. I was making 12 sequential API calls for a task that could be batched into one.

In other words: the problem wasn't usage. It was waste.

Over three weeks in February and March, I audited every agent, rewrote the calling patterns, and implemented four specific optimizations. March's bill: $339. Same agents. Same output quality. Same SLA. 60% reduction.

Here's the full breakdown.

The four levers that moved the number:

  • Prompt caching — Stop paying for the same tokens twice (or 200 times)
  • Model tiering — Use Haiku for simple tasks, Sonnet for complex ones, Opus only when you need to
  • Request batching — Async Batch API for non-urgent workloads cuts costs by 50% automatically
  • Prompt compression — Shrink context without shrinking quality

Lever 1: Prompt Caching — The Biggest Single Win

Prompt caching was the highest-leverage change. Before I implemented it, every single API call was sending the full system prompt. Orus's system prompt is 3,800 tokens. BellaBot's is 2,200. HyrosBot's is 1,900. At Claude Sonnet pricing ($3/M input tokens), I was paying for those tokens on every single turn — including the 40+ turns per day where the system prompt never changed.

Anthropic's prompt caching feature lets you mark prompt prefixes with a cache_control block. Once cached, you pay 10% of the normal input token price for cache hits. Cache writes cost 25% more than base input — but the cache lasts 5 minutes (and resets on each hit), so any session with more than ~3 turns breaks even immediately and saves money from turn 4 onward.

Implementation

The API change is minimal. You add cache_control to the content block you want cached:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

// Before: full system prompt sent every call
// After: system prompt cached after first call

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: ORUS_SYSTEM_PROMPT,  // 3,800 tokens
      cache_control: { type: "ephemeral" }  // <-- this is the change
    }
  ],
  messages: conversationHistory
});

// Check cache usage in response
console.log({
  cache_creation_input_tokens: response.usage.cache_creation_input_tokens,
  cache_read_input_tokens: response.usage.cache_read_input_tokens,
  input_tokens: response.usage.input_tokens
});

The first call writes the cache. Every subsequent call within 5 minutes reads from it. In a 20-turn session, turns 2–20 pay 10% instead of 100% for the system prompt. For Orus with a 3,800-token system prompt running at ~50 multi-turn sessions per day:

Orus caching math (monthly):

  • Sessions per day: ~50 × 30 days = 1,500 sessions
  • Avg turns per session: 8 (1 write + 7 reads)
  • System prompt: 3,800 tokens
  • Before: 1,500 × 8 × 3,800 = 45.6M tokens @ $3/M = $136.80
  • After: 1,500 × (3,800 × 1.25 + 3,800 × 7 × 0.10) = 8.4M effective tokens = $25.20
  • Savings: $111.60/month on system prompts alone

Caching Tool Definitions and Document Context

System prompts aren't the only thing worth caching. If you're injecting a large document, a knowledge base, or a tool list into every call — cache it. I cache HyrosBot's API reference (1,200 tokens), BellaBot's script library (2,100 tokens), and the MOSS API documentation (900 tokens) that I inject when any agent needs to interact with the task queue.

// Cache multiple content blocks independently
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  system: [
    {
      type: "text",
      text: AGENT_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" }
    },
    {
      type: "text",
      text: TOOL_DEFINITIONS,  // API docs, tool list
      cache_control: { type: "ephemeral" }
    },
    {
      type: "text",
      text: KNOWLEDGE_BASE_EXCERPT  // document context
      // No cache_control here — this changes per call
    }
  ],
  messages: conversationHistory
});

// Rule of thumb: cache anything that doesn't change
// between calls within a session. Don't cache the
// dynamic parts (current task context, recent events).

Total caching savings: ~$148/month across all agents. That alone was 17% of the original bill.

Lever 2: Model Tiering — Right-Sizing Every Task

This is where most people leave the most money on the table. The default move when building agent systems is to pick one model and use it everywhere. If you defaulted to Sonnet (the safe middle choice), you're overpaying for simple tasks and potentially underpaying for complex ones.

Here's the pricing spread as of March 2026 (per million tokens, input/output):

ModelInput $/MOutput $/MBest for
Haiku 4.5$0.80$4.00Classification, extraction, routing, simple QA
Sonnet 4.6$3.00$15.00Complex reasoning, code gen, multi-step tasks
Opus 4.6$15.00$75.00Highest-stakes decisions, nuanced judgment calls

Haiku is 3.75x cheaper than Sonnet on input and 3.75x cheaper on output. Opus is 5x more expensive than Sonnet. The spread is enormous. If you're running any task on Sonnet that Haiku can handle correctly, you're burning money.

My Task Classification Framework

I built a simple routing layer that assigns each request to the right model based on task type. The routing decision itself runs on Haiku (meta-efficient):

// model-router.ts
type TaskComplexity = "simple" | "standard" | "complex";

function classifyTask(taskDescription: string): TaskComplexity {
  const simplePatterns = [
    /extract|parse|classify|format|convert|summarize briefly/i,
    /yes or no|true or false|is this a|categorize/i,
    /get the (name|date|number|price|status)/i
  ];

  const complexPatterns = [
    /architect|design system|evaluate tradeoffs|nuanced/i,
    /write a full|comprehensive analysis|strategic/i,
    /review this code for security|legal implications/i
  ];

  if (simplePatterns.some(p => p.test(taskDescription))) {
    return "simple";
  }
  if (complexPatterns.some(p => p.test(taskDescription))) {
    return "complex";
  }
  return "standard";
}

function selectModel(complexity: TaskComplexity): string {
  const modelMap = {
    simple: "claude-haiku-4-5-20251001",
    standard: "claude-sonnet-4-6",
    complex: "claude-opus-4-6"
  };
  return modelMap[complexity];
}

// Usage
async function runTask(task: string, context: string) {
  const complexity = classifyTask(task);
  const model = selectModel(complexity);

  const response = await client.messages.create({
    model,
    max_tokens: complexity === "simple" ? 512 : 2048,
    messages: [{ role: "user", content: `${task}

Context: ${context}` }]
  });

  return { response, model, complexity };
}

What I Moved to Haiku

After auditing all agent tasks, I moved these to Haiku:

  • BellaBot intent classification — "Is this call a lead, existing client, or spam?" Simple 3-way classification. Haiku nails it.
  • HyrosBot data extraction — Parsing structured JSON responses from Hyros API into formatted summaries. Pure extraction.
  • MOSS status updates — Formatting subtask status for human-readable reports. No reasoning required.
  • Email triage — Classify inbound emails as urgent/normal/low, route to right agent. Pattern matching, not analysis.
  • Log summarization — Compress deploy logs into a 3-sentence summary for DevBot's morning report.
  • Webhook payload parsing — Extract specific fields from Retell call summaries and format for Supabase insertion.

Approximately 62% of my total API calls were doing tasks that Haiku handles correctly. Moving those calls from Sonnet to Haiku ($3/M → $0.80/M input) saved roughly $187/month on input tokens alone, plus disproportionate savings on output (Haiku outputs are often shorter and less verbose than Sonnet's by nature).

When to Use Opus

I use Opus for exactly two things: architectural decisions that affect multiple systems simultaneously, and high-stakes client communications where the wording genuinely matters. Not blog posts. Not code reviews. Not research summaries. Only decisions where a suboptimal answer has a real cost that exceeds the Opus premium.

In practice, that's maybe 3–5 calls per week. Opus is a specialist, not a workhorse.

Lever 3: Batching — 50% Off for Non-Urgent Work

Anthropic's Message Batches API is the most underused cost lever I know of. The deal is simple: submit requests asynchronously, get results back within 24 hours, pay 50% of standard pricing. If a task doesn't need a real-time response, there's almost no reason not to batch it.

I identified three categories of non-urgent work in my stack that are now 100% batched:

  • Daily reporting — HyrosBot compiles attribution summaries for the previous day. These run at 6am, results needed by 9am. 3-hour window = easy batch.
  • Content research — When I queue blog topics, the research phase (pulling reference material, summarizing sources) runs in batch overnight.
  • SEO analysis — Weekly analysis of post performance, keyword gaps, internal linking opportunities. No urgency.
  • Client brief generation — After a client onboarding call, Orus compiles the brief from call notes. Needed within 24 hours, not within 24 seconds.

Implementation

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

// Submit a batch of requests
async function submitBatch(tasks: Array<{ id: string; prompt: string }>) {
  const requests = tasks.map(task => ({
    custom_id: task.id,
    params: {
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      messages: [{ role: "user" as const, content: task.prompt }]
    }
  }));

  const batch = await client.messages.batches.create({ requests });
  console.log(`Batch submitted: ${batch.id}`);
  return batch.id;
}

// Poll for results (run this separately, e.g. via cron)
async function collectBatchResults(batchId: string) {
  const batch = await client.messages.batches.retrieve(batchId);

  if (batch.processing_status !== "ended") {
    console.log(`Still processing: ${batch.request_counts}`);
    return null;
  }

  const results: Record<string, string> = {};

  for await (const result of await client.messages.batches.results(batchId)) {
    if (result.result.type === "succeeded") {
      const content = result.result.message.content[0];
      if (content.type === "text") {
        results[result.custom_id] = content.text;
      }
    } else {
      console.error(`Failed: ${result.custom_id}`, result.result);
    }
  }

  return results;
}

// Daily report batch — submit at 6am, collect at 9am
const reportTasks = [
  { id: "hyros-summary", prompt: "Summarize yesterday's attribution data..." },
  { id: "deploy-log", prompt: "Summarize deploy logs from past 24h..." },
  { id: "client-metrics", prompt: "Pull VIXI client performance metrics..." }
];

const batchId = await submitBatch(reportTasks);
// Store batchId in database, collect results via cron 3h later

The 50% discount compounds with other optimizations. If you're using Haiku (already 73% cheaper than Sonnet) and batching (50% off that), you're paying roughly 13% of what you'd pay for synchronous Sonnet calls. That's the floor.

Batching savings: ~$89/month — the tasks I batch weren't individually expensive, but there were a lot of them.

Lever 4: Prompt Compression — Fewer Tokens, Same Signal

Prompt compression is the most manual lever but it has compounding effects — smaller prompts reduce costs across every model and every call. After auditing my agents, I found three categories of token waste:

1. Verbose System Prompt Preambles

My original Orus system prompt opened with three paragraphs explaining who Carlos Aragon is, what VIXI does, and the general philosophy of the agent stack. That context matters for onboarding a new conversation — but I was sending it on every turn, including turns 15 through 40 of a long session where the model already had full context.

Fix: I split the system prompt into two parts. The static business context goes in the cached prefix. The dynamic instructions — what Orus should focus on today, what tasks are pending — go in a short, session-specific injection that varies per call. The static portion: 2,400 tokens (cached). The dynamic portion: 300–500 tokens (not cached, but small). Total effective context per call dropped from 3,800 to ~2,800 tokens on non-cached turns.

2. Conversation History Bloat

Long sessions accumulate context fast. By turn 30 of a research session, I was sending 25,000+ tokens of conversation history on every call — including early turns that were completely irrelevant to the current task. The model was paying attention to things that didn't matter.

Fix: sliding window + periodic compression. After every 10 turns, I summarize the conversation so far into 400–600 tokens and replace the turn history with that summary:

// conversation-compressor.ts
const MAX_TURNS_BEFORE_COMPRESS = 10;

async function maybeCompressHistory(
  history: Array<{ role: string; content: string }>,
  client: Anthropic
): Promise<Array<{ role: string; content: string }>> {
  if (history.length < MAX_TURNS_BEFORE_COMPRESS * 2) {
    return history;  // No compression needed
  }

  // Keep the last 4 turns verbatim (recent context matters)
  const recentTurns = history.slice(-8);
  const olderTurns = history.slice(0, -8);

  // Compress the older portion on Haiku (cheap)
  const compressionResponse = await client.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 600,
    messages: [
      {
        role: "user",
        content: `Summarize this conversation history in 400-600 tokens.
Preserve: key decisions made, facts established, current task state.
Omit: pleasantries, redundant explanations, resolved questions.

History:
${olderTurns.map(t => `${t.role}: ${t.content}`).join("\n\n")}`
      }
    ]
  });

  const summary = compressionResponse.content[0].type === "text"
    ? compressionResponse.content[0].text
    : "";

  return [
    { role: "user", content: `[Previous conversation summary: ${summary}]` },
    { role: "assistant", content: "Understood. Continuing from that context." },
    ...recentTurns
  ];
}

3. Redundant Tool Result Echoing

When an agent calls a tool, the full tool result goes into the conversation history. If that result is a 3,000-token API response and the agent calls the tool five times per session, you're carrying 15,000 tokens of raw API data through every subsequent call.

Fix: compress tool results before adding them to history. After a tool call returns, run a Haiku pass to extract only the relevant fields. For HyrosBot pulling attribution data, the raw API response is 2,800 tokens. After compression: 180 tokens with all the numbers Orus actually needs.

// tool-result-compressor.ts
async function compressToolResult(
  toolName: string,
  rawResult: string,
  neededFields: string[],
  client: Anthropic
): Promise<string> {
  const fieldList = neededFields.join(", ");

  const response = await client.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 300,
    messages: [
      {
        role: "user",
        content: `Extract only these fields from the ${toolName} result: ${fieldList}
Return as compact JSON. Omit nulls and empty arrays.

Raw result:
${rawResult}`
      }
    ]
  });

  return response.content[0].type === "text"
    ? response.content[0].text
    : rawResult;  // fallback to original if compression fails
}

// Usage after tool call
const rawHyrosData = await callHyrosAPI(params);
const compressed = await compressToolResult(
  "hyros_attribution",
  rawHyrosData,
  ["roas", "cpa", "revenue", "ad_spend", "top_campaigns"],
  client
);

// Add compressed version to history, not raw
conversationHistory.push({
  role: "tool",
  content: compressed  // ~180 tokens vs 2,800
});

Compression savings: ~$73/month — smaller than caching and tiering, but fully combinable with them.

The Full Picture: Before and After

Let me put all four levers together with real numbers:

OptimizationMonthly SavingsEffort to Implement
Prompt caching$148Low — 1 line per content block
Model tiering$187Medium — audit tasks, build router
Request batching$89Medium — refactor non-urgent calls
Prompt compression$73High — audit, rewrite, add compressors
Total$497 saved$847 → $350 → $339 actual

The $350 estimate vs. $339 actual gap: I found one more quick win during implementation — I was calling Sonnet for a task that needed zero model reasoning (pure string formatting). Moved it to a local function. That's not an API optimization, that's just good engineering. Sometimes the right model is no model.

The order I'd implement if starting over:

  1. Model tiering first — Highest impact per hour of work. Audit your tasks, move simple ones to Haiku.
  2. Prompt caching second — One line of code per content block. Should take an afternoon.
  3. Batching third — Identify non-urgent workflows, refactor to async. Requires process redesign.
  4. Compression last — Most manual work, lowest marginal impact. Do it after the others are in place.

What I Won't Optimize (and Why)

There are cost-reduction moves I explicitly decided against. Knowing what not to optimize is as important as knowing what to fix.

I won't tier down Orus's core reasoning to Haiku. Orus handles complex, multi-step orchestration where wrong decisions cascade. The cost of a bad decision from an undertrained model far exceeds the token savings. Sonnet is the floor for orchestration.

I won't compress BellaBot's real-time call context. Call screening happens in 2–3 second windows. Compression latency would break the user experience. Some tasks need speed more than they need cost efficiency.

I won't batch client-facing communications. When a client expects a response, they don't expect "I'll have this in 24 hours." Batch savings don't justify the relationship cost.

I won't aggressively compress prompts for creative work. Blog post drafts, client proposals, and marketing copy benefit from rich context. Compressing the input for creative tasks often compresses the output quality in ways that require expensive revision cycles. The apparent savings evaporate.

The meta-lesson: cost optimization is a constraint satisfaction problem, not a pure minimization problem. You're minimizing cost subject to quality, latency, and reliability constraints. The constraints vary by task. Don't apply the same playbook everywhere.

The Mindset Shift That Made the Difference

The technical optimizations — caching, tiering, batching, compression — are all real and all necessary. But the underlying shift that made them possible was treating API costs as an engineering problem rather than a business expense.

When cost is just a line item on the P&L, you accept it. When it's an engineering metric with observable causes, you fix it. The difference is instrumentation. I spent the first two weeks of this optimization project just adding logging: track which model was used for every call, log input and output token counts, tag every call with the agent name and task type. Once I could see the breakdown, the high-impact targets were obvious.

If you're running agents in production and you haven't audited your token usage by task type, that audit will pay for itself within a week. The waste is almost always concentrated in 20% of your call patterns. Find those 20%, apply the four levers, and the other 80% takes care of itself.

For my stack: $847 → $339. Same agents. Same clients. Same SLA. The difference is now compounding — as I add new agents and new tasks, I start from a 60%-cheaper baseline.

Running agents at scale and watching costs climb?

I'm happy to walk through your stack and identify where the waste is. The four-lever framework applies to most production agent systems — the hard part is the audit, not the fixes.

Get in touch →