Guides

AI Token Cost Calculation: A Pricing-Independent Framework for Forecasting LLM Spend (2026)

Katerina Merzlova

Digital Transformation Consultant

Kirill Funtikov

R&D Lead R&D Lead

33 mins | June 10, 2026

TL;DR

Prices change almost weekly. Your cost model shouldn’t. This framework forecasts LLM spend with durable ratios, not a price table that goes stale in a month.
The formula has six variables: active users, sessions, input and output tokens per session, retry rate, cache hit rate, and agent multiplier.
Three ratios stay stable as prices fall: output costs about 5× input, flagship models cost 15–30× economy models, and agentic workflows use 3–10× the tokens of simple RAG.
The highest-impact decision is tier routing — send most traffic to economy-tier models, reserve flagship-tier for the few percent that need it.
Plug today’s prices into the calculator for a live number. The logic below tells you what to enter, and why.

EXCEL TEMPLATE: AI token cost calculator — estimate monthly LLM spend by users, tokens, and model tier

This AI token cost calculator gives you a live number with current pricing. The article below explains the formula behind it, the four pricing tiers that outlive model names, and the five techniques that cut your bill 40–70%.

When a $12,000 forecast became $34,000

A SaaS company we worked with launched an internal copilot for 800 employees in early 2025. They budgeted $12,000 a month. By month three the bill was $34,000.

Nothing dramatic had happened. The vendor hadn’t raised prices, and headcount on the tool grew 14%, not 180%. The team had simply built their estimate on one assumption: one user message, one model reply. But production doesn’t work that way. Real conversations run several turns, users hit “regenerate,” and the retrieval step kept stuffing more context into every prompt than anyone had budgeted for.

We helped them rebuild the forecast. Then, over the following year, something more useful happened. Three new model generations shipped, and prices dropped more than once. So each time, the team rebuilt their estimate from scratch, with a new model, a new spreadsheet, and a new number. Eventually we noticed the obvious thing: the prices were the only part that ever changed. The shape of the bill never moved at all, whether that meant how input relates to output, how a cheap model relates to an expensive one, or how an agent multiplies calls.

That is what this article is about. We at SumatoSoft have built AI systems for clients across 25+ countries, and the framework here, plus the AI token cost calculator at the top of the page, is built to survive the next model release, and the one after. It’s also Pillar 2, Financial Governance, of our Agentic Development Lifecycle.

Why this article doesn’t lead with a price table

Most LLM cost guides open with a table of model prices. We don’t, for one plain reason: any such table is wrong within weeks.

For equivalent capability, frontier prices fell roughly 10× between 2023 and 2025. For example, a GPT-4-class model that cost about $60 per million output tokens at launch has equivalents today in the low single digits. (That’s the price of equivalent capability falling, not the flagship price-point itself dropping tenfold, since the newest flagship still sits at a premium.) And new generations land every few months, each one resetting the lineup. So if you write “use Model X at $Y,” you’ve published a fact with a one-quarter shelf life.

What doesn’t expire is the structure underneath:

A formula that turns usage into cost.
The ratios between input and output, and between cheap and expensive tiers.
The drivers that inflate consumption.
And the levers that bring it back down.

Learn those once, and you can price any lineup: today’s, next quarter’s, or next year’s. The current price then becomes a single number you look up at the moment of calculation, not the foundation of what you understand. That’s also why the calculator is the right home for live pricing, because it holds today’s numbers and gets refreshed as the market moves. This article, by contrast, holds the part that doesn’t.

The anatomy of an LLM bill

Before you forecast, though, know what you’re paying for. This section has almost no perishable numbers in it, just one ratio that matters.

Providers bill per token. One token is roughly four characters, or about 0.75 words in English, so a 500-word email runs about 650 tokens. And every provider meters this way, which isn’t going to change.

Input and output are priced separately, and the distinction is the whole game. Input is everything you send: the system prompt, retrieved context in a RAG setup (retrieval-augmented generation, where the model is fed relevant documents at query time), the conversation history, and the user’s message. Output, by contrast, is what the model writes back. So they behave nothing alike on cost.

Here is the one durable number to carry with you: output tokens cost roughly 5× input, and on current flagship models the ratio is closer to 6×. This holds across providers and across model generations, and it holds for a structural reason. Input tokens are processed in parallel, so the model reads them more or less at once. Output tokens, however, are generated one at a time, each one depending on the last. Faster hardware drives the absolute price down every year. Still, the asymmetry between “read in parallel” and “write sequentially” stays, so the ratio stays with it.

The practical consequence: a forecast that counts only input — the single most common mistake we see — misses the expensive half of the bill entirely.

The Forecasting Formula

There are two versions. One for a quick check, one for a number you’d actually defend to a CFO.

The core equation

Monthly Cost = Users × Sessions_per_User × Avg_Tokens_per_Session × Price_per_Token

Fine for thirty seconds of arithmetic. Not fine for a budget.

The production formula

Monthly Cost = ( (Users × DAU × Sessions/day × Avg_Input_Tokens × Input_Price) + (Users × DAU × Sessions/day × Avg_Output_Tokens × Output_Price) + (Retry_rate × Base_cost) – (Cache_hit_rate × Cacheable_cost)) × Agent_Multiplier

Notice where price shows up: two inputs, Input_Price and Output_Price. Everything else is about your usage and your architecture — which is where forecasting accuracy actually comes from. When prices change, you swap two numbers and the structure holds.

The six variables

Monthly Active Users (MAU). Active, not registered. Use a 30-day rolling count. No telemetry yet? Estimate 20–40% of registered users as active.
Sessions per active day. Pull this from a comparable product. An internal copilot runs 3–8 sessions per active day; a customer-support agent, 10–25 per agent; a field-service copilot, 2–4 per technician.
Average tokens per session — input and output, separately. Input is system prompt plus retrieved context plus history plus the user’s message. Output is the response. Model them apart, because of that ~5× gap.
Retry rate. 5–15% in mature production. Covers user regenerations plus system retries for safety filters, validation failures, and rate limits.
Cache hit rate. The share of requests served from cache. Two different mechanisms live here — see the note below, because they don’t save money the same way.
Agent multiplier. Simple RAG is 1×. Self-correcting RAG, about 3×. Light agents with two or three tools, 3–5×. Full agentic systems with planning and reasoning loops, 5–10×. Our Agentic RAG implementation guide goes deep on this.

A note on the two cache types, because mixing them up is how forecasts drift. Prompt-prefix caching is native to the major APIs. It discounts repeated input only — up to about 90% off the cached portion — and never touches output. Semantic caching is a layer you add. It returns a full stored response — input and output together — for queries close enough to ones already answered. So in the formula, prompt-prefix caching reduces only the input part of the cost, while semantic caching reduces the whole eligible response. The two are not interchangeable.

Worked example — an internal HR/IT copilot

We’ll use representative workhorse-tier pricing here: $3 input, $15 output per million tokens, a typical mid-tier rate in mid-2026. When you run this for your own project, drop in the current price for whichever tier you’ve chosen — the method is identical and the prices are interchangeable.

The setup:

A SaaS firm, 5,000 employees. Internal HR/IT copilot, used by 60% of staff.
MAU 3,000. Active days 18 a month. Four sessions per active day.
Input 2,500 tokens per session (system prompt 300 + RAG context 1,800 + user message 400). Output 400.
Retry rate 8%. Semantic cache hit rate 35% — a full-response cache, so it applies to the whole base. Agent multiplier 1.5× (light agentic RAG).

Step 1, sessions per month: 3,000 × 18 × 4 = 216,000
Step 2, input cost: 216,000 × 2,500 = 540M input tokens → 540 × $3 = $1,620
Step 3, output cost: 216,000 × 400 = 86.4M output tokens → 86.4 × $15 = $1,296
Step 4, base cost: $1,620 + $1,296 = $2,916
Step 5, retries and cache: retries add 8% (+$233); the semantic cache removes 35% of the base (−$1,021). Adjusted base: $2,128
Step 6, agent multiplier: $2,128 × 1.5 = $3,192 a month

That’s a number you can put in front of a CFO. Run it again with any other tier’s prices and only the two price inputs move; the steps are the same.

One honest caveat. Applying the agent multiplier at the very end is a simplification. It’s a fair one here, because this is a light-agentic example at 1.5×. But if you’re modeling a heavy agentic system at 5–10×, don’t multiply at the end. The loops happen per call, so caching and retries interact with each loop — you should model it call by call. Name the simplification, and know where it stops being safe.this is a light-agentic example at 1.5×. But if you’re modeling a heavy agentic system at 5–10×, don’t multiply at the end. The loops happen per call, so caching and retries interact with each loop — you should model it call by call. Name the simplification, and know where it stops being safe.

The Four Pricing Tiers (a Framework That Outlives Model Names)

Rather than a table of names that’ll be obsolete by autumn, here’s the durable structure. Every provider’s lineup, in every generation, sorts into four tiers. The names rotate. The tiers don’t.

What lasts is the spread, not any single price: cheapest tier to most expensive runs roughly 15–30×. It has held or widened over the years, because even as every absolute price dropped, new nano-tier models kept pulling the floor down. That spread is what makes tier routing the single highest-impact cost decision available to you.

For today’s specific models and their prices, use the calculator above or go straight to the source: OpenAI, Anthropic, Google Cloud Vertex AI, AWS Bedrock. To slot a model into the table, compare its output price to the cheapest model in the same provider’s lineup.

The Six Factors That Drive Token Spend

Same shape for each: what it is, why it inflates the bill, how to model it, and how to control it. It’s all ratios and behavior, with no absolute prices, so none of it goes stale.

1. User volume and frequency growth

Consumption scales linearly with active users and session frequency, and new AI products routinely see MAU grow 2–4× in the first six months. The trap, then, is forecasting on the launch-day user count, and watching the bill double by month four. So model it with a 12-month growth curve instead of a static number. Then control it by tiering access, because not everyone needs unlimited use on day one.

2. Context window size

Every token in context, including the system prompt, RAG chunks, history, and few-shot examples, is billed as input on every request. Teams load context “just in case,” and prompts accrete. For example, a system prompt that started lean picks up 30–50% bloat over six months as edge cases get patched in. So track average input tokens per request as a KPI, and compare it to the minimum that holds quality. Then control it with prompt compression, selective retrieval, and RAG over long context, retrieving the 2K tokens that matter instead of stuffing a 100K window. (Our RAG development work lives here.)

3. Agent loop multiplication

An agentic workflow makes several model calls per user request, in order to plan, retrieve, pick tools, evaluate, and respond. So one query can fan out into 5–15 internal calls, each carrying its own context, which is how consumption climbs 3–10× over simple RAG. Model it with the agent multiplier. Then control it by capping reasoning loops with a maximum step count, running sub-steps on economy-tier models, and caching intermediate results.

4. Retry and error rate

Users regenerate answers (5–10%), while systems retry for safety filters, failed validation, and rate limits (another 2–5%). Each retry duplicates the cost of that request, so a 10% retry rate quietly raises your effective per-request price by 10%. So budget a 5–15% buffer, and measure the real rate after launch. Better prompts cut regenerations, structured output formats cut parsing failures, and input pre-validation cuts filter triggers.

5. Model tier choice

This is defaulting to flagship-tier when workhorse or economy would do the job. Given a 15–30× spread, running everything on flagship is the most expensive architecture you can build. So estimate the share of queries that genuinely need flagship reasoning, which for most workloads is 5–15%. Then route by need. This, in fact, is the biggest lever in the whole framework, which is why it leads the optimization section below.

6. Caching hit rate

This is the share of requests answered from cache rather than a paid call. At 0% every request is paid, while at 50% you roughly halve the eligible cost. So estimate it by query type: FAQ-style support hits 40–70%, code generation 10–25%, and creative work close to zero. Then control it with prompt-prefix caching for repeated context, and semantic caching for repeated queries.

How to Cut Token Spend by 40–70%

Five techniques, ordered by impact. Every one is architectural — it depends on how you build, not on what anything costs — so all five survive the next price change. They’re a focused cut of our fuller AI Cost Reduction Playbook.

1. Tier routing. 30–60% off overall spend. Medium effort. Route roughly 70% of queries to economy-tier, 25% to workhorse, and keep flagship for the ~5% that earn it. Skip it only when every query is uniformly complex.

2. Semantic caching. 25–50% on repeated queries. Medium effort. Skip it for highly personalized or creative output.

3. Prompt compression and a system-prompt audit. 15–40% on input. Low effort to audit, higher to re-engineer. A prompt that’s been live for six months almost always carries removable weight.

4. Output constraints. 10–25% on output, and often the largest absolute saving, because output is the expensive half. Low effort: set max-output limits and use structured formats. Still, skip it where long-form generation is the point.

5. Hybrid search to shrink RAG context. 10–25% on RAG input. Medium effort. Better recall means less context stuffed into each request.

Applied together, these usually land 40–70% off total spend within 90 days. The low end is teams that adopt the first three; the high end is teams that do all five with proper instrumentation in place.

Three Deployment Scales, Modeled in Tokens

Dollar figures date the moment you publish them. So each scale below is expressed in monthly token volume and tier mix, the part that lasts. In practice, this is the most reliable way to estimate enterprise AI monthly cost without re-doing the work every time prices move: multiply the volume by your chosen tier’s current price from the calculator. The illustrative ranges are mid-2026 workhorse-tier and dated as such, so re-derive them with current numbers.

Small: internal copilot, 50–100 users. Basic RAG, economy-tier for most queries, workhorse for the hard ones. Roughly 150M input and 25M output tokens a month. So the cost is driven by RAG context on the input side, and the fastest win is prompt-prefix caching. Illustrative range, mid-2026: about $400–$900 a month.

Mid: customer-support agent, 500–1,500 users. Agentic RAG with three to five tools, workhorse-tier primary, economy-tier for sub-steps. Roughly 1.5B input and 250M output tokens a month. So here, the agent multiplier (1.5–3×) dominates the bill, and the biggest lever is tier routing: economy for retrieval and classification, and workhorse only for synthesis. Illustrative range, mid-2026: about $6,000–$14,000 a month.

Large: multi-domain platform, 5,000+ users. Multiple agents, multimodal context, fine-tuned specialists, cross-department routing. Roughly 15B input and 2.5B output tokens a month. Cost comes from agent loops plus multimodal context, and the lever that matters most is routing across providers, not just within one provider’s tiers. Illustrative range, mid-2026: about $50,000–$140,000 a month.

These volumes assume mature optimization. Before it, expect 40–70% higher.

LLM cost composition by deployment scale — input, output, retry, and agent overhead as a percentage of monthly token spend

Seven Common Token Budget Mistakes

Every one of these is behavioral, which is why every one stays relevant no matter what models cost.

Counting only input tokens. Output is the expensive half (~5×). Always split the two.
No retry buffer. Add 10% to the baseline, then track the real rate and adjust.
Treating registered users as active. Only 20–40% usually are. Forecast on rolling MAU.
Ignoring context-window growth. Bills creep upward even at flat user counts. Audit prompts quarterly.
Flagship-tier as the default. With a 15–30× spread, this is the most expensive setup possible. Route by need.
No caching layer. The same questions get paid for thousands of times over. Caching pays back in weeks.
No post-launch monitoring. Review weekly for the first 90 days and alert on cost-per-request anomalies.

Why LLM Prices Change — and How to Stay Current

Since this framework treats price as a moving input, it helps to know why it moves and how to keep your forecast honest.

The direction is down, and the cause is structural — not a passing promotion. Providers compete on price. Inference hardware keeps getting cheaper, and model architectures keep getting more efficient. New entrants keep undercutting the field. Together those forces took frontier prices down roughly 10× in two years for equivalent capability, and every new generation tends to push capability up while pushing price down.

For your forecast, that means the number you approve today will probably look conservative in six months, because the same workload will cost less. That’s a comfortable direction to be wrong in — but only if you re-forecast instead of setting it once and forgetting it.

Four habits keep you current:

Re-forecast quarterly. Re-run the calculator with current prices and swap your estimated usage variables for measured ones.
Monitor continuously. Watch cost-per-request and cost-per-active-user as live metrics, not month-end surprises. Observability tooling flags anomalies within days.
Re-check your tier mix on every major model release. A new economy-tier model can often absorb work that used to require workhorse-tier — capability up, cost down, no quality loss.
Watch the ratios, not just the prices. If a release breaks the usual output-to-input ratio or compresses the tier spread, that’s your signal to re-architect the routing.

This is exactly why the calculator and the article are split. The calculator is the living document you re-run. The article is the method that tells you how.

Putting Token Costs in Your Business Case

BBring a CFO three scenarios, not one number. The buffers are ratios, so they hold as prices move.

Conservative: 1.4× the expected figure. Absorbs faster user growth and slower-than-planned optimization.
Expected: the calculator’s output at current prices. This is your approval line.
Optimistic: 0.7× the expected figure. Reflects mature optimization plus the downward price trend.

Set each against the manual-workflow cost the AI is replacing or augmenting, state the payback period, and build the controls in from day one: budget caps, weekly variance alerts, quarterly re-forecasts.sts.

Download the AI Token Cost Forecast Template (Excel) — three pre-built scenarios with editable price inputs you update each quarter.

Frequently Asked Questions

How do you calculate the monthly cost of an LLM application?

Here is how to calculate LLM cost in practice: multiply monthly active users by sessions per active day, then by average tokens per session (input and output separately), then by the current per-token price for each. Adjust for your retry rate and cache hit rate, then multiply by the agent multiplier — 1× for simple RAG, up to 10× for fully agentic systems. The formula and a worked example are above.

Why does this guide avoid a fixed price table?

Because LLM prices change almost monthly and fell roughly 10× in two years for equivalent capability. A fixed table is stale within weeks. The durable parts — the formula, the ~5× output premium, the ~15–30× tier spread, the optimization levers — let you forecast with any current price. Look up today’s numbers in the calculator or on the provider pages.

Why are output tokens more expensive than input tokens?

Output is generated one token at a time, each depending on the last, while input is processed in parallel. So output runs about 5× the price of input, closer to 6× on current flagships, across providers and generations. Cheaper hardware lowers the absolute price, but not the ratio.

How much does Agentic RAG cost compared to simple RAG?

Agentic RAG consumes 3–10× the tokens of simple RAG, because each user query triggers several internal model calls to plan, retrieve, use tools, and evaluate. So self-correcting RAG sits near the low end, while full reasoning-loop agents sit near the high end.

Should I use a flagship or an economy-tier model?

Both, for different queries. So route most traffic, around 70%, to economy-tier, some to workhorse, and reserve flagship-tier for the 5–15% of queries that genuinely need deep reasoning. With a 15–30× tier spread, then, routing is the single biggest cost lever you have.

How much can caching reduce LLM costs?

Prompt-prefix caching discounts repeated input by up to ~90% on the cached portion. Semantic caching serves full stored responses for repeated queries and typically cuts 25–50% on eligible traffic such as FAQ-style support. Creative or highly personalized workloads benefit far less.

Is it cheaper to self-host open-source models?

Only at sustained high volume. Self-hosting turns competitive somewhere in the range of tens to hundreds of millions of tokens a month, and the crossover depends heavily on GPU utilization, since an under-loaded GPU can cost more per token than a hosted API. So below that, and for teams without MLOps capacity, hosted APIs usually win once you count engineering time.

What’s a realistic token budget for a small AI pilot?

For a 4–8 week pilot serving 50–200 users on simple RAG, expect a modest volume — on the order of low hundreds of millions of tokens total — and add a 30% buffer for spikes. Convert to dollars with current economy- or workhorse-tier pricing. If the pilot is fully agentic from day one, multiply the volume by 3–5×.

Summary: What to Remember

Prices change; the model doesn’t. Forecast with durable ratios and treat the current price as a single input.
Six variables drive the bill: active users, session frequency, input and output tokens per session, retry rate, cache hit rate, agent multiplier.
Three ratios hold steady: output ~5× input, flagship ~15–30× economy, agentic ~3–10× simple RAG.
Tier routing is the biggest lever, so send most traffic to economy-tier, and flagship only where it earns its keep.
40–70% savings inside 90 days come from routing, semantic caching, prompt compression, output constraints, and hybrid search.
Re-forecast quarterly and monitor continuously. The price trend runs downward, so disciplined teams keep beating their own budgets.

How We Can Help

At SumatoSoft, we build AI systems where cost is something you design, not something you discover on the invoice. The framework here is Pillar 2 — Financial Governance — of our Agentic Development Lifecycle, the methodology behind every AI engagement we run.

What we deliver:

AI Readiness Assessment — a token-burn forecast, ROI model, and architecture review for your specific use case.
AI software development — custom systems with cost governance built into the architecture, not bolted on after.
RAG development and AI agent development — production builds with token-cost discipline from day one.
AI PoC development — 4–8 week pilots with measurable outcomes and defensible cost models.

Why SumatoSoft: 14+ years on the market, 350+ delivered custom solutions, 25+ countries. ISO 27001 certified, with a bug-fix guarantee agreed upfront. 70% senior engineers, and AI practice leads who’ve shipped production systems across healthcare, fintech, logistics, and manufacturing.

Start in three steps

Book a 30-minute token forecast session. Bring your use case, your current bill if you have one, and your projected user volume.
Get a defensible monthly forecast — three scenarios, built on durable ratios rather than a price snapshot.
Get a 90-day optimization roadmap with the specific techniques that will cut your cost 40–70%, ordered by impact and effort.

No qualification form to fill out first. Bring your numbers; we’ll bring 14 years of building this for other people.

Schedule your token forecast session →

Part of our AI Engineering Leadership cluster. For the methodology behind cost-governed AI, read What Is ADLC (Agentic Development Lifecycle). The full optimization playbook lives in The AI Cost Reduction Playbook. And the technical patterns behind retrieval-augmented agents are in our Agentic RAG Enterprise Implementation Guide.

This framework is built to stay accurate as model prices change. For current per-token pricing, use the calculator above or the provider pages: OpenAI, Anthropic, Google Cloud Vertex AI, AWS Bedrock. We review the durable ratios in this article twice a year.