Skip to content
Skip to main content
NEW — 2026-04-08Claude API · Cost Optimization · Production

AI Agent Cost Optimization: How I Cut My Claude API Bill by 60%

I was paying $800/month to run 11 AI agents. After one weekend of focused optimization, that number dropped to $312 — with identical output quality. Here's exactly what I changed, with real numbers.

AI Agent Cost Optimization — reducing Claude API costs by 60%

The Wake-Up Call: My First $800 API Bill

March 2026. I opened my Anthropic dashboard expecting the usual ~$400 bill and saw $812.47. Same number of agents. Same task volume. Different model pricing tier that had silently rolled over.

The embarrassing part: when I audited my logs, I discovered I was using Claude Opus for spam classification. I was paying premium architect-level rates to answer the question "is this email from a real person?" — a task that a $0.25/1M token model handles just as well.

I decided to treat this like any other engineering problem: measure first, optimize second. I spent a Saturday tagging every API call with task type, model, input tokens, output tokens, and cost. The numbers were clarifying.

The outcome after one weekend:

  • → Monthly spend: $812 → $312
  • → Cost reduction: 61.6%
  • → Output quality: Unchanged (measured by task success rate)
  • → Total setup time: ~12 hours

Four strategies drove that reduction. I'll walk through each one with real numbers and copy-paste code.

Baseline: What I Was Actually Spending

Before optimizing anything, I needed visibility. I added a thin logging wrapper around every Anthropic SDK call and let it run for 48 hours. Here's what the breakdown looked like:

Agent / Task TypeModel UsedTasks/DayAvg TokensMonthly Cost
Email triage / classificationOpus120~800$184
Lead scoringOpus60~1,200$138
Content draftsSonnet30~4,000$108
Workflow routing decisionsOpus300~400$230
Daily summaries / reportsSonnet15~6,000$94
Complex strategy tasksSonnet5~8,000$58
Total$812

The pattern was obvious immediately: workflow routing (300 tasks/day at Opus rates) was my biggest single line item, and it was doing the simplest possible task — pick A, B, or C based on a few keywords. That's a $0.25 Haiku job, not a $15 Opus job.

Strategy 1: Model Tiering — The Biggest Win (~35% savings)

The mental model that changed everything for me: think of model tiers like staffing levels, not quality levels.

  • Haiku— commodity worker. Fast, cheap, accurate on well-defined tasks. Classification, routing, extraction, yes/no questions. ~$0.25/1M input tokens.
  • Sonnet— senior generalist. Good reasoning, solid writing, handles nuance. Most production tasks land here after tiering. ~$3/1M input tokens.
  • Opus— architect/specialist. Only when you need deep reasoning, complex multi-step chains, or tasks where quality difference actually matters to revenue. ~$15/1M input tokens.

I wrote a routing function that selects the model based on task type. The key insight: 70% of production tasks are classification or extraction — and Haiku handles those at 98%+ accuracy vs. Opus in my testing.

lib/model-router.tsTypeScript
type TaskType =
  | 'classification'
  | 'extraction'
  | 'routing'
  | 'summarization'
  | 'drafting'
  | 'analysis'
  | 'strategy';

type ModelTier =
  | 'claude-haiku-4-5-20251001'
  | 'claude-sonnet-4-6'
  | 'claude-opus-4-6';

const MODEL_ROUTING: { [key in TaskType]: ModelTier } = {
  classification: 'claude-haiku-4-5-20251001',  // $0.25/1M — was Opus ($15)
  extraction:     'claude-haiku-4-5-20251001',  // structured data pull
  routing:        'claude-haiku-4-5-20251001',  // pick from defined options
  summarization:  'claude-haiku-4-5-20251001',  // condense existing text
  drafting:       'claude-sonnet-4-6',           // write new content
  analysis:       'claude-sonnet-4-6',           // interpret + reason
  strategy:       'claude-opus-4-6',             // complex multi-step planning
};

export function selectModel(taskType: TaskType): ModelTier {
  return MODEL_ROUTING[taskType];
}

// Usage in your agent:
const response = await anthropic.messages.create({
  model: selectModel('classification'),  // auto-selects Haiku
  max_tokens: 50,
  messages: [{ role: 'user', content: prompt }]
});

Results from tiering alone:

  • → Workflow routing: $230 → $14/month (Opus → Haiku)
  • → Email classification: $184 → $12/month (Opus → Haiku)
  • → Lead scoring: $138 → $48/month (Opus → Sonnet)
  • → Total monthly savings: ~$278

Note: lead scoring stayed on Sonnet because it involves interpreting free-form CRM notes — Haiku accuracy dropped to 89% on that task vs. 97% for Sonnet. Always test before downgrading high-stakes decisions.

Strategy 2: Prompt Caching — The Silent Multiplier (~25% savings)

Anthropic's prompt caching lets you mark parts of your prompt as cacheable. When the same prefix is reused across requests, you pay 10% of the normal input token price for those cached tokens.

For agents that run frequently with the same system prompt, this is essentially free money. My email classification agent has a 1,100-token system prompt that fires 120 times a day. Before caching, I was paying full price on every single call.

The implementation is one addition to your API call — add cache_control to the content block you want cached:

lib/cached-agent.tsTypeScript
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

// With prompt caching — pay 10% on system prompt after first call
const response = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 100,
  system: [
    {
      type: 'text',
      text: LONG_SYSTEM_PROMPT,  // your 800-1200 token system prompt
      cache_control: { type: 'ephemeral' }  // <-- this is the only change
    }
  ],
  messages: [
    {
      role: 'user',
      content: userMessage  // only this changes per request
    }
  ]
});

// First call: full price for system prompt tokens
// Subsequent calls within ~5 min: 90% discount on those tokens
// Cache TTL: ~5 minutes, refreshed on each hit

The cache is valid for roughly 5 minutes and refreshed each time it's hit. For agents that process bursts of requests — like my Monday.com webhook handler that sometimes gets 20 items in 30 seconds — the hit rate approaches 100%.

Key requirement: your system prompt must be at least 1,024 tokens to qualify for caching on Haiku (2,048 tokens for Sonnet/Opus). If your prompts are shorter, combine related system prompts or add your few-shot examples into the same cached block to cross the threshold.

My 30-day caching stats:

  • → Cache hit rate: 78%
  • → Avg system prompt size: ~950 tokens
  • → Daily calls across all agents: ~530
  • → Monthly savings from caching: ~$94

Strategy 3: Batching Non-Realtime Tasks (~15% savings)

Not every task needs to run the instant it's triggered. I audited my agents and found that roughly 40% of my tasks were non-realtime — daily summaries, content drafts, CRM enrichment, report generation. None of those need to fire within seconds.

Anthropic's Batch API processes requests asynchronously and costs 50% less than the standard API. The tradeoff: results come back within 24 hours instead of seconds. For batch jobs like "summarize all leads from today," that's completely fine.

lib/batch-processor.tsTypeScript
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

// Instead of 50 individual API calls, submit one batch request
async function processBatch(items: { id: string; content: string }[]) {
  const batch = await anthropic.messages.batches.create({
    requests: items.map(item => ({
      custom_id: item.id,
      params: {
        model: 'claude-haiku-4-5-20251001',
        max_tokens: 200,
        messages: [{
          role: 'user',
          content: `Summarize this lead note in 2 sentences: ${item.content}`
        }]
      }
    }))
  });

  console.log(`Batch ${batch.id} submitted — status: ${batch.processing_status}`);
  return batch.id;
}

// Poll for results (or configure a webhook)
async function getBatchResults(batchId: string) {
  const batch = await anthropic.messages.batches.retrieve(batchId);

  if (batch.processing_status === 'ended') {
    const results = [];
    for await (const result of await anthropic.messages.batches.results(batchId)) {
      results.push({ id: result.custom_id, output: result.result });
    }
    return results;
  }
  return null; // not ready yet, check again later
}

In my n8n setup, I have a scheduled workflow that runs at 11 PM every night: it collects all non-urgent AI tasks queued during the day, submits them as a single batch, and stores results in Supabase by 6 AM. My morning briefing agent then picks up the pre-processed summaries rather than calling Claude in real time.

Batching impact:

  • → Tasks moved to batch: ~45/day (summaries + CRM enrichment)
  • → Cost reduction on batched tasks: 50%
  • → Monthly savings: ~$38

Strategy 4: Prompt Compression — Free Money

I write verbose prompts. "Please carefully analyze the following email and determine whether the sender appears to be a genuine prospect who is likely interested in our services, or whether it appears to be spam or irrelevant..."

That's 43 tokens. The compressed version: "Classify: prospect or spam? Email:" — 7 tokens. Same accuracy on Haiku, 84% fewer input tokens.

lib/prompt-utils.tsTypeScript
// Before: verbose prompt (~280 tokens)
const verbosePrompt = `
  You are an expert email triage assistant for a digital marketing agency.
  Your job is to carefully read through each incoming email and determine
  with high confidence whether the sender is a legitimate prospective client
  who has genuine interest in our services, or whether the email should be
  classified as spam, cold outreach, or irrelevant. Please respond with
  your classification and a brief explanation of your reasoning.
`;

// After: compressed prompt (~45 tokens — 84% reduction)
const compressedPrompt = `
  Email classifier.
  Categories: [prospect, spam, existing-client, vendor, other].
  Respond ONLY: {"category": "...", "confidence": 0.0-1.0}
`;

// Compression checklist:
// 1. Remove filler ("Please carefully", "Your job is to")
// 2. Use structured output requests instead of prose descriptions
// 3. Move static examples to cached few-shot block (not per-call)
// 4. Specify output format explicitly — prevents verbose responses
// 5. Use JSON/bullets over long sentences

// Bonus: use Claude itself to compress your prompts
const compressionMeta = `
  Compress this prompt to under 80 tokens without losing core constraints.
  Preserve all output format requirements. Output only the compressed version.
  Prompt: [paste your prompt here]
`;

I ran the compression meta-prompt on my 11 most-called system prompts. Average compression ratio: 71%. On 300 routing calls/day × 200 fewer tokens each = 60,000 tokens/day saved — compounding across 30 days.

Strategy 5: Output Token Control — Often Overlooked

Output tokens cost 3–5x more than input tokens depending on the model. If you're not explicitly setting max_tokens per task type, you're paying for verbosity.

config/task-limits.tsTypeScript
const MAX_TOKENS: { [key in TaskType]: number } = {
  classification:  50,    // "prospect" + confidence score
  extraction:     200,    // structured JSON output
  routing:         20,    // single letter/word decision
  summarization:  300,    // 2-3 sentence summary
  drafting:      1500,    // actual content creation
  analysis:       800,    // reasoning + conclusion
  strategy:      2000,    // full strategic output
};

// Force JSON output to prevent verbose preambles
const classificationSystem = `
  Respond ONLY with valid JSON: {"category": "prospect|spam|other", "score": 0-100}
  No explanation. No preamble. Only JSON.
`;

// With Haiku + max_tokens:50 + JSON-only instruction:
// Average output: ~28 tokens
// Without constraints: ~180 tokens
// Savings: 84% on the most expensive token type

Full Before/After Breakdown

StrategyMonthly SavingsSetup TimeDifficulty
1. Model tiering (Opus → Haiku/Sonnet)~$2783 hrsLow
2. Prompt caching on system prompts~$942 hrsLow
3. Batch API for async tasks~$384 hrsMedium
4. Prompt compression~$622 hrsLow
5. Output token limits + JSON formatting~$281 hrLow
Total$500/month12 hrs
$812
Before (March 2026)
$312
After (April 2026)
61.6%
Cost reduction

Prioritized Action List

If you're running Claude agents in production today, here's the exact sequence I'd follow:

  1. 1

    Audit — add logging to every API call

    Log: model, task type, input tokens, output tokens, latency, cost. Run for 48 hours before changing anything. You can't optimize what you can't see.

  2. 2

    Tier your models (highest ROI, ~3 hours)

    Find every Opus call doing classification or routing. Move to Haiku. Test accuracy before/after. This alone typically cuts 30–40% off your bill.

  3. 3

    Add cache_control to every long system prompt

    One line of code per agent. Immediate 10–25% savings on high-frequency calls. Requires prompts >1,024 tokens (Haiku) or >2,048 tokens (Sonnet/Opus).

  4. 4

    Move async-safe tasks to the Batch API

    Anything with acceptable latency >1 hour qualifies. Daily summaries, enrichment, reports. 50% price reduction, zero quality difference.

  5. 5

    Compress your top 5 highest-volume prompts

    Use Claude itself to compress them. Add explicit max_tokens limits. Force JSON output on structured tasks. Expect 60–80% reduction in prompt verbosity.

The Bottom Line

Cost optimization on Claude API isn't about using worse models or cutting corners — it's about using the right model for each task. Most production workloads are dominated by classification and routing calls that Haiku handles just as well as Opus at 1/60th the price.

The 12 hours I spent on this generates $500/month in perpetual savings — $6,000/year from one weekend of focused work. The techniques compound as you add more agents.

If you're running AI agents in production and haven't audited your costs, your first step is simply adding a logging wrapper. Everything else flows from having visibility into what you're actually spending.

Carlos Aragón

AI Agent Builder · Allen, TX · VIXI LLC

Running 11 production agents. Writing about what actually works.

Work With Me