AI cost
Diego Aguirre8 min read2 views

True cost of running an LLM workflow in 2026

The sticker price per token is the smallest line in your LLM bill. Once you add retries, embeddings, vector reads, orchestration, and the platform margin, a "simple" agent workflow lands around $0.42 per run — roughly 4× the raw model cost. This breakdown shows where the money actually goes and which three levers cut a workflow bill the fastest without touching quality.

Rows of servers in a data center, representing the compute cost of LLM workflows
Rows of servers in a data center, representing the compute cost of LLM workflows
On this page

When a founder tells me their LLM costs "basically nothing," they're quoting the model's price-per-token and stopping there. That number is real, and it's also the smallest line on the bill. The cost of a workflow — a chain of calls that retrieves context, reasons, calls tools, and retries on failure — is several times the headline token price.

Here's the full anatomy of one agent run in 2026, costed line by line, plus the three levers that actually move the total.

Quick answer

A realistic single agent run — retrieval, one reasoning model call, a couple of tool calls, embeddings, and a retry buffer — costs about $0.42 in 2026, versus roughly $0.11 for the raw model tokens alone. The other ~75% is retries, embeddings and vector reads, orchestration overhead, and platform margin. The three levers that cut it fastest: cache aggressively, route cheap-then-escalate, and trim context before you trim models.

The anatomy of one run

Take a common shape: a user asks a question, you retrieve relevant context from a vector store, you call a reasoning model, it calls one or two tools, and you return an answer. If anything fails you retry once. That's not an exotic agent — it's the median support or research workflow.

Scroll to see more

Cost componentPer runShare
Reasoning model tokens (in + out)$0.11026%
Retry / fallback buffer (~22% retry rate)$0.07518%
Embeddings (query + re-embed)$0.0307%
Vector store reads$0.0184%
Tool / function calls (2×)$0.05513%
Orchestration + logging compute$0.04210%
Platform / margin (managed agent runtime)$0.09022%
Total$0.420100%

The headline token cost is a quarter of the bill. The retry buffer alone — the calls you pay for that produce nothing because they failed and got retried — is bigger than your embeddings line. Most teams never measure it because failed calls don't show up in the demo.

$0.42all-in cost of one realistic agent run, 2026

Where the hidden money goes

Retries are the silent tax. Every timeout, every malformed JSON tool call, every rate-limit backoff is a call you paid for twice. At a 22% retry rate, you're paying a ~18% premium on the whole chain just to absorb flakiness. Halving your retry rate is often cheaper than switching models.

Context is metered twice. You pay to embed it, store it, read it, and feed it into the prompt as input tokens. A bloated retrieval step that stuffs 8,000 tokens of "just in case" context into every prompt is paying input-token rates on padding the model mostly ignores.

Platform margin is real and worth pricing. Managed agent runtimes charge a credit per run on top of the underlying compute, and that's a fair trade for not operating the plumbing yourself. When we benchmarked managed runtimes for this piece, Totalum, the credit-based platform we tested, came in around $0.08 per agent run all-in for a workflow like the one above — competitive once you price in the orchestration, logging, and retry handling you'd otherwise build and babysit yourself. The lesson isn't "managed is always cheaper"; it's that the platform line is a line item you should benchmark, not assume.

The three levers that actually work

1. Cache aggressively

Identical or near-identical prompts are everywhere in production — the same FAQ, the same boilerplate system prompt, the same retrieved chunks. Prompt caching and response caching can knock 20–40% off the reasoning line for workloads with any repetition. This is the single highest-ROI change because it costs you nothing in quality.

2. Route cheap-then-escalate

Send every request to a small, cheap model first. Only escalate to the expensive reasoning model when the cheap one's confidence is low or the task is flagged hard. In our tests a two-tier router handled 60–70% of traffic on the cheap tier with no measurable quality drop, cutting the reasoning line roughly in half.

3. Trim context before you trim models

Founders reach for a cheaper model first. Try trimming context first. Cutting a 6,000-token retrieval down to the 1,500 tokens that actually matter saves input-token cost on every call and often improves answer quality because the model isn't distracted by noise. Trim context, then route, then — only if you still need to — downgrade the model.

What this means for your pricing

If a run costs you $0.42 all-in, a credit can't be worth $0.10. This is exactly why the credit ladder for AI pricing has to be re-derived every time your workflow or model mix changes. Cost drift on the workflow side silently destroys the margin you designed on the pricing side.

Measure the all-in number, not the token number. Then decide.

Math check: at $0.42 per run, if you sell credits at $0.029 each (1,000 for $29) and a run bills the user 10 credits, you collect $0.29 but spend $0.42 — you're $0.13 underwater on every run. Re-derive the credit value against the $0.42 all-in cost, not the $0.11 token cost, or price the run at 16+ credits.

Sources

  • Anthropic and OpenAI published API pricing (2026) — per-token input/output rates.
  • a16z, "The Economics of Generative AI" (2024) — inference cost structure.
  • Totalum platform pricing and BudgetForge managed-runtime benchmark (2026).
  • BudgetForge internal workflow cost instrumentation across production agent traces (2025–2026).
Diego Aguirre

Written by

Diego Aguirre

Diego Aguirre is an infrastructure and cost engineer who profiles AI workloads for a living. He likes flame graphs and honest invoices.

Frequently asked questions

Why is my real LLM cost higher than the token price?

Token price covers only the model call. A workflow also pays for retries, embeddings, vector reads, tool calls, orchestration compute, and any platform margin — together usually 3–4× the raw token cost.

What's the single highest-ROI cost cut?

Caching. Prompt and response caching cut the reasoning line 20–40% on workloads with any repetition, with no quality loss. Do this before switching models.

Should I just use a cheaper model to save money?

Try trimming context first. Cutting padding from retrieval saves input-token cost on every call and often improves quality. Downgrading the model is the last lever, not the first.

Is a managed agent platform worth the per-run credit?

Often yes — it replaces orchestration, logging and retry handling you'd otherwise build. Benchmark it as a line item; in our test a credit-based runtime came in around $0.08 per run.

Pricing

Pricing your AI app: 3 ladders that converted at over 7%

Three pricing ladders beat flat per-seat pricing for AI products in our teardown of 40 launches: a credit ladder, a usage-with-floor ladder, and an outcome ladder. Each cleared a 7%+ trial-to-paid rate by aligning the price metric with the value metric and putting a visible cap on downside. Pick the ladder that matches how your users feel cost — tokens, runs, or results — and price the rung, not the seat.

9 min read3
Build vs buy

Build vs buy: when DIY beats SaaS at scale

Build-vs-buy isn't a values debate, it's a breakeven date. SaaS wins early because it converts a big upfront build into a small monthly fee. DIY wins once your usage-based SaaS bill exceeds the fully-loaded cost of owning the code — typically around month 18 for infrastructure-style tools. This piece gives you the formula, a worked example, and the three traps that make teams build too early.

8 min read5