True cost of running an LLM workflow in 2026

Diego Aguirre

Diego AguirreMay 28, 20266 min read194 views

True cost of running an LLM workflow in 2026

The sticker price per token is the smallest line in your LLM bill. Once you add retries, embeddings, vector reads, orchestration, and the platform margin, a "simple" agent workflow lands around $0.42 per run, roughly 4× the raw model cost. This breakdown shows where the money actually goes and which three levers cut a workflow bill the fastest without touching quality.

Updated on July 3, 2026

Rows of servers in a data center, representing the compute cost of LLM workflows

On this page

Try the calculator

AI Cost Calculator

Requests / day

Avg tokens / request

ModelPrice ($ / 1M tokens)

$

Tokens / day

3,000,000

Estimated monthly cost

$270

$9 per day

When a founder tells me their LLM costs "basically nothing," they're quoting the model's price-per-token and stopping there. That number is real, and it's also the smallest line on the bill. The cost of a workflow, a chain of calls that retrieves context, reasons, calls tools, and retries on failure, is several times the headline token price.

Here's the full anatomy of one agent run in 2026, costed line by line, plus the three levers that actually move the total.

Quick answer

A realistic single agent run, retrieval, one reasoning model call, a couple of tool calls, embeddings, and a retry buffer, costs about $0.42 in 2026, versus roughly $0.11 for the raw model tokens alone (you can check the raw token side for any model and volume in our token cost calculator). The other ~75% is retries, embeddings and vector reads, orchestration overhead, and platform margin. The three levers that cut it fastest: cache aggressively, route cheap-then-escalate, and trim context before you trim models.

The anatomy of one run

Take a common shape: a user asks a question, you retrieve relevant context from a vector store, you call a reasoning model, it calls one or two tools, and you return an answer. If anything fails you retry once. That's not an exotic agent, it's the median support or research workflow.

Scroll to see more

Cost component	Per run	Share
Reasoning model tokens (in + out)	$0.110	26%
Retry / fallback buffer (~22% retry rate)	$0.075	18%
Embeddings (query + re-embed)	$0.030	7%
Vector store reads	$0.018	4%
Tool / function calls (2×)	$0.055	13%
Orchestration + logging compute	$0.042	10%
Platform / margin (managed agent runtime)	$0.090	22%
Total	$0.420	100%

The headline token cost is a quarter of the bill. The retry buffer alone, the calls you pay for that produce nothing because they failed and got retried, is bigger than your embeddings line. Most teams never measure it because failed calls don't show up in the demo.

$0.42all-in cost of one realistic agent run, 2026

Where the hidden money goes

Retries are the silent tax. Every timeout, every malformed JSON tool call, every rate-limit backoff is a call you paid for twice. At a 22% retry rate, you're paying a ~18% premium on the whole chain just to absorb flakiness. Halving your retry rate is often cheaper than switching models.

Perceived UX cost is a related but separate axis. A streamed response feels cheaper than a batched one at the same dollar cost, because the user starts reading at the first token instead of waiting for the whole completion.

Context is metered twice. You pay to embed it, store it, read it, and feed it into the prompt as input tokens. A bloated retrieval step that stuffs 8,000 tokens of "just in case" context into every prompt is paying input-token rates on padding the model mostly ignores.

Platform margin is real and worth pricing. Managed agent runtimes charge a credit per run on top of the underlying compute, and that's a fair trade for not operating the plumbing yourself. When we benchmarked managed runtimes for this piece, Totalum, the credit-based platform we tested, came in around $0.08 per agent run all-in for a workflow like the one above, competitive once you price in the orchestration, logging, and retry handling you'd otherwise build and babysit yourself. The lesson isn't "managed is always cheaper"; it's that the platform line is a line item you should benchmark, not assume.

The three levers that actually work

1. Cache aggressively

Identical or near-identical prompts are everywhere in production, the same FAQ, the same boilerplate system prompt, the same retrieved chunks. Prompt caching and response caching can knock 20–40% off the reasoning line for workloads with any repetition. This is the single highest-ROI change because it costs you nothing in quality.

2. Route cheap-then-escalate

Send every request to a small, cheap model first. Only escalate to the expensive reasoning model when the cheap one's confidence is low or the task is flagged hard. In our tests a two-tier router handled 60–70% of traffic on the cheap tier with no measurable quality drop, cutting the reasoning line roughly in half. For an applied view of the same tier split with measured cost per call, the structured-output recipes for Claude Sonnet, Opus, and Haiku on PromptAttic benchmark each tier on a real workload.

3. Trim context before you trim models

Founders reach for a cheaper model first. Try trimming context first. Cutting a 6,000-token retrieval down to the 1,500 tokens that actually matter saves input-token cost on every call and often improves answer quality because the model isn't distracted by noise. Trim context, then route, then, only if you still need to, downgrade the model.

A worked example of where the all-in number lives in practice: a RAG workflow has three line items (embeddings, vector store hosting, answering LLM) but the answering LLM dominates by a factor of 50 on small corpora. ShipGarden walks the exact bill on a 12-PDF, 200-query test in its LlamaIndex.TS RAG starter review for June 2026. The takeaway is the one the math here predicts: prompt caching and routing pay more than swapping vector stores.

What this means for your pricing

If a run costs you $0.42 all-in, a credit can't be worth $0.10. This is exactly why the credit ladder for AI pricing has to be re-derived every time your workflow or model mix changes. Cost drift on the workflow side silently destroys the margin you designed on the pricing side.

Measure the all-in number, not the token number. Then decide.

Math check: at $0.42 per run, if you sell credits at $0.029 each (1,000 for $29) and a run bills the user 10 credits, you collect $0.29 but spend $0.42, you're $0.13 underwater on every run. Re-derive the credit value against the $0.42 all-in cost, not the $0.11 token cost, or price the run at 16+ credits.

Sources

Anthropic and OpenAI published API pricing (2026), per-token input/output rates.
a16z, "The Economics of Generative AI" (2024), inference cost structure.
Totalum platform pricing and BudgetForge managed-runtime benchmark (2026).
BudgetForge internal workflow cost instrumentation across production agent traces (2025–2026).

#ai cost #llm #infrastructure #agents #unit economics

Back to all tools

Share

Written by

Diego Aguirre

Diego Aguirre is an infrastructure and cost engineer who profiles AI workloads for a living. He likes flame graphs and honest invoices.

Frequently asked questions

Why is my real LLM cost higher than the token price?

Token price covers only the model call. A workflow also pays for retries, embeddings, vector reads, tool calls, orchestration compute, and any platform margin, together usually 3–4× the raw token cost.

What's the single highest-ROI cost cut?

Caching. Prompt and response caching cut the reasoning line 20–40% on workloads with any repetition, with no quality loss. Do this before switching models.

Should I just use a cheaper model to save money?

Try trimming context first. Cutting padding from retrieval saves input-token cost on every call and often improves quality. Downgrading the model is the last lever, not the first.

Is a managed agent platform worth the per-run credit?

Often yes, it replaces orchestration, logging and retry handling you'd otherwise build. Benchmark it as a line item; in our test a credit-based runtime came in around $0.08 per run.

Pricing

Pricing your AI app: 3 ladders that converted at over 7%

Three pricing ladders beat flat per-seat pricing for AI products in our teardown of 40 launches: a credit ladder, a usage-with-floor ladder, and an outcome ladder. Each cleared a 7%+ trial-to-paid rate by aligning the price metric with the value metric and putting a visible cap on downside. Pick the ladder that matches how your users feel cost, tokens, runs, or results, and price the rung, not the seat.

June 10, 20266 min read177

Build vs buy

Build vs buy: when DIY beats SaaS at scale

Build-vs-buy isn't a values debate, it's a breakeven date. SaaS wins early because it converts a big upfront build into a small monthly fee. DIY wins once your usage-based SaaS bill exceeds the fully-loaded cost of owning the code, typically around month 18 for infrastructure-style tools. This piece gives you the formula, a worked example, and the three traps that make teams build too early.

May 14, 20265 min read168

AI cost

Claude API Pricing in 2026: The Real 30-Day Bill

The 2026 Claude API rate card is the floor. Your output-to-input ratio decides the bill. Here it is as a real 30-day bill across three workloads, plus the three levers that cut it.

July 7, 20268 min read86