AI Token Cost Calculation: A Pricing-Independent Framework for Forecasting LLM Spend (2026)

Forecast LLM spend before you buildForecast LLM spend before you build

TL;DR

  • Model names and prices change almost weekly. Your cost model should not. This framework forecasts LLM spend with durable ratios, not a price table that goes stale within a month.
  • The formula has six variables: active users, sessions, input and output tokens per session, retry rate, cache hit rate, and agent multiplier.
  • Three ratios stay stable as prices fall: output costs ~5× input (rising toward 6× on current flagships), flagship models cost ~15–30× economy models, and agentic workflows consume 3–10× the tokens of simple RAG.
  • The highest-leverage decision is tier routing — send most traffic to economy-tier models, reserve flagship-tier for the few percent that need it.
  • Plug today’s prices into the calculator above for a live number. The logic below tells you what to plug in and why.

EXCEL TEMPLATE: AI token cost calculator — estimate monthly LLM spend by users, tokens, and model tier

This AI token cost calculator gives you a live number with current pricing. The article below explains the formula behind it, the four pricing tiers that outlive model names, and the five techniques that cut your bill 40–70%.


A SaaS company we worked with launched an internal copilot for 800 employees in early 2025. They budgeted $12,000 a month. By month three the bill was $34,000.

Nothing dramatic had happened. The vendor hadn’t raised prices. Headcount on the tool grew 14%, not 180%. The team had just built their estimate on a single assumption — one user message, one model reply — and production doesn’t work that way. Real conversations run several turns. Users hit “regenerate.” The retrieval step kept stuffing more context into every prompt than anyone had budgeted for.

We helped them rebuild the forecast, and then something more useful happened over the following year. Three new model generations shipped. Prices dropped, more than once. And each time, the team rebuilt their estimate from scratch — new model, new spreadsheet, new number. Eventually we noticed the obvious thing: the prices were the only part that ever changed. The shape of the bill — how input relates to output, how a cheap model relates to an expensive one, how an agent multiplies calls — never moved at all.

That is what this article is about. We at SumatoSoft have built AI systems for clients across 25+ countries, and the framework here — along with the AI token cost calculator at the top of the page — is meant to survive the next model release, and the one after that. It also happens to be Pillar 2 — Financial Governance — of our Agentic Development Lifecycle.

Why This Article Doesn’t Lead With a Price Table

Most LLM cost guides open with a table of model prices. We don’t, for one plain reason: any such table is wrong within weeks.

For equivalent capability, frontier prices fell roughly 10× between 2023 and 2025 — a GPT-4-class model that cost about $60 per million output tokens at launch has equivalents today in the low single digits. (That’s the price of equivalent capability falling, not the flagship price-point itself dropping tenfold; the newest flagship still sits at a premium.) New generations land every few months, each one resetting the lineup. Write “use Model X at $Y” and you’ve published a fact with a one-quarter shelf life.

What doesn’t expire is the structure underneath:

  • The formula that turns usage into cost.
  • The ratios between input and output, and between cheap and expensive tiers.
  • The drivers that inflate consumption.
  • The levers that bring it back down.

Learn those once and you can price any lineup — today’s, next quarter’s, next year’s. The current price becomes a single number you look up at the moment of calculation, not the foundation of what you understand. That’s also why the calculator is the right home for live pricing: it holds today’s numbers and gets refreshed as the market moves. This article holds the part that doesn’t.

Three ratios

The Anatomy of an LLM Bill

Before you forecast, know what you’re paying for. This section has almost no perishable numbers in it — just one ratio that matters.

Providers bill per token. One token is roughly four characters, or about 0.75 words in English; a 500-word email runs about 650 tokens. Every provider meters this way and that isn’t going to change.

Input and output are priced separately, and the distinction is the whole game. Input is everything you send: the system prompt, retrieved context in a RAG setup (retrieval-augmented generation — where the model is fed relevant documents at query time), the conversation history, the user’s message. Output is what the model writes back. They behave nothing alike on cost.

Here is the one durable number to carry with you: output tokens cost roughly 5× input, and on current flagship models the ratio is closer to 6×. This holds across providers and across model generations, and it holds for a structural reason. Input tokens are processed in parallel — the model reads them more or less at once. Output tokens are generated one at a time, each one depending on the last. Faster hardware drives the absolute price down every year, but the asymmetry between “read in parallel” and “write sequentially” stays, so the ratio stays with it.

The practical consequence: a forecast that counts only input — the single most common mistake we see — misses the expensive half of the bill entirely.

The Forecasting Formula

There are two versions. One for a quick check, one for a number you’d actually defend to a CFO.

Forecasting formula

The core equation

Monthly Cost = Users × Sessions_per_User × Avg_Tokens_per_Session × Price_per_Token

Fine for thirty seconds of arithmetic. Not fine for a budget.

The production formula

Monthly Cost = ( (Users × DAU × Sessions/day × Avg_Input_Tokens × Input_Price) +  (Users × DAU × Sessions/day × Avg_Output_Tokens × Output_Price) +  (Retry_rate × Base_cost) –   (Cache_hit_rate × Cacheable_cost)) × Agent_Multiplier

Notice where price shows up: two inputs, Input_Price and Output_Price. Everything else is about your usage and your architecture — which is where forecasting accuracy actually comes from. When prices change, you swap two numbers and the structure holds.

The six variables

  • Monthly Active Users (MAU). Active, not registered. Use a 30-day rolling count. No telemetry yet? Estimate 20–40% of registered users as active.
  • Sessions per active day. Pull this from a comparable product. An internal copilot runs 3–8 sessions per active day; a customer-support agent, 10–25 per agent; a field-service copilot, 2–4 per technician.
  • Average tokens per session — input and output, separately. Input is system prompt plus retrieved context plus history plus the user’s message. Output is the response. Model them apart, because of that ~5× gap.
  • Retry rate. 5–15% in mature production. Covers user regenerations plus system retries for safety filters, validation failures, and rate limits.
  • Cache hit rate. The share of requests served from cache. Two different mechanisms live here — see the note below, because they don’t save money the same way.Agent multiplier. Simple RAG is 1×. Self-correcting RAG, about 3×. Light agents with two or three tools, 3–5×. Full agentic systems with planning and reasoning loops, 5–10×. Our Agentic RAG implementation guide goes deep on this.

A note on the two cache types, because mixing them up is how forecasts drift. Prompt-prefix caching is native to the major APIs. It discounts repeated input only — up to about 90% off the cached portion — and never touches output. Semantic caching is a layer you add. It returns a full stored response — input and output together — for queries close enough to ones already answered. So in the formula, prompt-prefix caching reduces only the input part of the cost, while semantic caching reduces the whole eligible response. The two are not interchangeable.

Worked example — an internal HR/IT copilot

We’ll use representative workhorse-tier pricing here: $3 input, $15 output per million tokens, a typical mid-tier rate in mid-2026. When you run this for your own project, drop in the current price for whichever tier you’ve chosen — the method is identical and the prices are interchangeable.

The setup:

  • A SaaS firm, 5,000 employees. Internal HR/IT copilot, used by 60% of staff.
  • MAU 3,000. Active days 18 a month. Four sessions per active day.
  • Input 2,500 tokens per session (system prompt 300 + RAG context 1,800 + user message 400). Output 400.
  • Retry rate 8%. Semantic cache hit rate 35% — a full-response cache, so it applies to the whole base. Agent multiplier 1.5× (light agentic RAG).

Step 1 — sessions per month: 3,000 × 18 × 4 = 216,000

Step 2 — input cost: 216,000 × 2,500 = 540M input tokens → 540 × $3 = $1,620

Step 3 — output cost: 216,000 × 400 = 86.4M output tokens → 86.4 × $15 = $1,296

Step 4 — base cost: $1,620 + $1,296 = $2,916

Step 5 — retries and cache: retries add 8% (+$233); the semantic cache removes 35% of the base (−$1,021). Adjusted base: $2,128

Step 6 — agent multiplier: $2,128 × 1.5 = $3,192 a month

That’s a number you can put in front of a CFO. Run it again with any other tier’s prices and only the two price inputs move; the steps are the same.

One honest caveat. Applying the agent multiplier at the very end is a simplification. It’s a fair one here, because this is a light-agentic example at 1.5×. But if you’re modeling a heavy agentic system at 5–10×, don’t multiply at the end. The loops happen per call, so caching and retries interact with each loop — you should model it call by call. Name the simplification, and know where it stops being safe.

The Four Pricing Tiers (a Framework That Outlives Model Names)

Rather than a table of names that’ll be obsolete by autumn, here’s the durable structure. Every provider’s lineup, in every generation, sorts into four tiers. The names rotate. The tiers don’t.

Pricing tiers

The durable insight is the spread, not any single price: cheapest tier to most expensive runs roughly 15–30×, and it has held or widened over the years — even as every absolute price dropped, new nano-tier models kept pulling the floor down. That spread is what makes tier routing the single highest-leverage cost decision available to you.

For today’s specific models and their prices, use the calculator above or go straight to the source: OpenAI, Anthropic, Google Cloud Vertex AI, AWS Bedrock. To slot a model into the table, compare its output price to the cheapest model in the same provider’s lineup.

The Six Factors That Drive Token Spend

Same shape for each: what it is, why it inflates the bill, how to model it, how to control it. All ratios and behavior, no absolute prices — so none of it goes stale.

1. User volume and frequency growth

Consumption scales linearly with active users and session frequency, and new AI products routinely see MAU grow 2–4× in the first six months. The trap is forecasting on the launch-day user count, then watching the bill double by month four. Model it with a 12-month growth curve instead of a static number. Control it by tiering access — not everyone needs unlimited use on day one.

2. Context window size

Every token in context — system prompt, RAG chunks, history, few-shot examples — is billed as input on every request. Teams load context “just in case,” and prompts accrete: a system prompt that started lean picks up 30–50% bloat over six months as edge cases get patched in. Track average input tokens per request as a KPI and compare it to the minimum that holds quality. Control it with prompt compression, selective retrieval, and RAG over long context — retrieve the 2K tokens that matter instead of stuffing a 100K window. (Our RAG development work lives here.)

3. Agent loop multiplication

An agentic workflow makes several model calls per user request — to plan, retrieve, pick tools, evaluate, respond. One query can fan out into 5–15 internal calls, each carrying its own context, which is how consumption climbs 3–10× over simple RAG. Model it with the agent multiplier. Control it by capping reasoning loops with a maximum step count, running sub-steps on economy-tier models, and caching intermediate results.

Token cascade

4. Retry and error rate

Users regenerate answers (5–10%); systems retry for safety filters, failed validation, and rate limits (another 2–5%). Each retry duplicates the cost of that request, so a 10% retry rate quietly raises your effective per-request price by 10%. Budget a 5–15% buffer and measure the real rate after launch. Better prompts cut regenerations; structured output formats cut parsing failures; input pre-validation cuts filter triggers.

5. Model tier choice

This is defaulting to flagship-tier when workhorse or economy would do the job. Given a 15–30× spread, running everything on flagship is the most expensive architecture you can build. Estimate the share of queries that genuinely need flagship reasoning — for most workloads it’s 5–15%. Then route by need. This is the biggest lever in the whole framework, which is why it leads the optimization section below.

6. Caching hit rate

The share of requests answered from cache rather than a paid call. At 0% every request is paid; at 50% you roughly halve the eligible cost. Estimate it by query type — FAQ-style support hits 40–70%, code generation 10–25%, creative work close to zero. Control it with prompt-prefix caching for repeated context and semantic caching for repeated queries.

How to Cut Token Spend by 40–70%

Five techniques, ordered by leverage. Every one is architectural — it depends on how you build, not on what anything costs — so all five survive the next price change. They’re a focused cut of our fuller AI Cost Reduction Playbook.

1. Tier routing. 30–60% off overall spend. Medium effort. Route roughly 70% of queries to economy-tier, 25% to workhorse, and keep flagship for the ~5% that earn it. Skip it only when every query is uniformly complex.

2. Semantic caching. 25–50% on repeated queries. Medium effort. Skip it for highly personalized or creative output.

3. Prompt compression and a system-prompt audit. 15–40% on input. Low effort to audit, higher to re-engineer. A prompt that’s been live six months almost always carries removable weight.

4. Output constraints. 10–25% on output — and often the largest absolute saving, because output is the expensive half. Low effort: set max-output limits and use structured formats. Skip it where long-form generation is the point.

5. Hybrid search to shrink RAG context. 10–25% on RAG input. Medium effort. Better recall means less context stuffed into each request.

Applied together, these usually land 40–70% off total spend within 90 days. The low end is teams that adopt the first three; the high end is teams that do all five with proper instrumentation in place.

Optimization

Three Deployment Scales, Modeled in Tokens

Dollar figures date the moment you publish them, so each scale below is expressed in monthly token volume and tier mix — the part that lasts. This is the most reliable way to estimate enterprise AI monthly cost without re-doing the work every time prices move: multiply the volume by your chosen tier’s current price from the calculator. The illustrative ranges are mid-2026 workhorse-tier and dated as such; re-derive them with current numbers.

Small — internal copilot, 50–100 users. Basic RAG, economy-tier for most queries, workhorse for the hard ones. Roughly 150M input and 25M output tokens a month. The cost is driven by RAG context on the input side; the fastest win is prompt-prefix caching. Illustrative range, mid-2026: about $400–$900 a month.

Mid — customer-support agent, 500–1,500 users. Agentic RAG with three to five tools, workhorse-tier primary, economy-tier for sub-steps. Roughly 1.5B input and 250M output tokens a month. Here the agent multiplier (1.5–3×) dominates the bill, and the biggest lever is tier routing — economy for retrieval and classification, workhorse only for synthesis. Illustrative range, mid-2026: about $6,000–$14,000 a month.

Large — multi-domain platform, 5,000+ users. Multiple agents, multimodal context, fine-tuned specialists, cross-department routing. Roughly 15B input and 2.5B output tokens a month. Cost comes from agent loops plus multimodal context, and the lever that matters most is routing across providers, not just within one provider’s tiers. Illustrative range, mid-2026: about $50,000–$140,000 a month.

These volumes assume mature optimization. Before it, expect 40–70% higher.

LLM cost composition by deployment scale — input, output, retry, and agent overhead as a percentage of monthly token spend

Seven Common Token Budget Mistakes

Every one of these is behavioral, which is why every one stays relevant no matter what models cost.

  1. Counting only input tokens. Output is the expensive half (~5×). Always split the two.
  2. No retry buffer. Add 10% to the baseline, then track the real rate and adjust.
  3. Treating registered users as active. Only 20–40% usually are. Forecast on rolling MAU.
  4. Ignoring context-window growth. Bills creep upward even at flat user counts. Audit prompts quarterly.
  5. Flagship-tier as the default. With a 15–30× spread, this is the most expensive setup possible. Route by need.
  6. No caching layer. The same questions get paid for thousands of times over. Caching pays back in weeks.
  7. No post-launch monitoring. Review weekly for the first 90 days and alert on cost-per-request anomalies.

Why LLM Prices Change — and How to Stay Current

Since this framework treats price as a moving input, it helps to know why it moves and how to keep your forecast honest.

The direction is down, and the cause is structural — not a passing promotion. Providers compete on price. Inference hardware keeps getting cheaper, and model architectures keep getting more efficient. New entrants keep undercutting the field. Together those forces took frontier prices down roughly 10× in two years for equivalent capability, and every new generation tends to push capability up while pushing price down.

Price decline

For your forecast, that means the number you approve today will probably look conservative in six months, because the same workload will cost less. That’s a comfortable direction to be wrong in — but only if you re-forecast instead of setting it once and forgetting it.

Four habits keep you current:

  • Re-forecast quarterly. Re-run the calculator with current prices and swap your estimated usage variables for measured ones.
  • Monitor continuously. Watch cost-per-request and cost-per-active-user as live metrics, not month-end surprises. Observability tooling flags anomalies within days.
  • Re-check your tier mix on every major model release. A new economy-tier model can often absorb work that used to require workhorse-tier — capability up, cost down, no quality loss.
  • Watch the ratios, not just the prices. If a release breaks the usual output-to-input ratio or compresses the tier spread, that’s your signal to re-architect the routing.

This is exactly why the calculator and the article are split. The calculator is the living document you re-run. The article is the method that tells you how.

Putting Token Costs in Your Business Case

Bring a CFO three scenarios, not one number. The buffers are ratios, so they hold as prices move.

  • Conservative — 1.4× the expected figure. Absorbs faster user growth and slower-than-planned optimization.
  • Expected — the calculator’s output at current prices. This is your approval line.
  • Optimistic — 0.7× the expected figure. Reflects mature optimization plus the downward price trend.

Set each against the manual-workflow cost the AI is replacing or augmenting, state the payback period, and build the controls in from day one: budget caps, weekly variance alerts, quarterly re-forecasts.

Download the AI Token Cost Forecast Template (Excel) — three pre-built scenarios with editable price inputs you update each quarter.

Frequently Asked Questions

How do you calculate the monthly cost of an LLM application?

Here is how to calculate LLM cost in practice: multiply monthly active users by sessions per active day, then by average tokens per session (input and output separately), then by the current per-token price for each. Adjust for your retry rate and cache hit rate, then multiply by the agent multiplier — 1× for simple RAG, up to 10× for fully agentic systems. The formula and a worked example are above.

Why does this guide avoid a fixed price table?

Because LLM prices change almost monthly and fell roughly 10× in two years for equivalent capability. A fixed table is stale within weeks. The durable parts — the formula, the ~5× output premium, the ~15–30× tier spread, the optimization levers — let you forecast with any current price. Look up today’s numbers in the calculator or on the provider pages.

Why are output tokens more expensive than input tokens?

Output is generated one token at a time, each depending on the last, while input is processed in parallel. Output runs about 5× the price of input, closer to 6× on current flagships, across providers and generations. Cheaper hardware lowers the absolute price but not the ratio.

How much does Agentic RAG cost compared to simple RAG?

Agentic RAG consumes 3–10× the tokens of simple RAG, because each user query triggers several internal model calls to plan, retrieve, use tools, and evaluate. Self-correcting RAG sits near the low end; full reasoning-loop agents near the high end.

Should I use a flagship or an economy-tier model?

Both, for different queries. Route most traffic — around 70% — to economy-tier, some to workhorse, and reserve flagship-tier for the 5–15% of queries that genuinely need deep reasoning. With a 15–30× tier spread, routing is the single biggest cost lever you have.

How much can caching reduce LLM costs?

Prompt-prefix caching discounts repeated input by up to ~90% on the cached portion. Semantic caching serves full stored responses for repeated queries and typically cuts 25–50% on eligible traffic such as FAQ-style support. Creative or highly personalized workloads benefit far less.

Is it cheaper to self-host open-source models?

Only at sustained high volume. Self-hosting turns competitive somewhere in the range of tens to hundreds of millions of tokens a month, and the crossover depends heavily on GPU utilization — an under-loaded GPU can cost more per token than a hosted API. Below that, and for teams without MLOps capacity, hosted APIs usually win once you count engineering time.

What’s a realistic token budget for a small AI pilot?

For a 4–8 week pilot serving 50–200 users on simple RAG, expect a modest volume — on the order of low hundreds of millions of tokens total — and add a 30% buffer for spikes. Convert to dollars with current economy- or workhorse-tier pricing. If the pilot is fully agentic from day one, multiply the volume by 3–5×.

Summary: What to Remember

  • Prices change; the model doesn’t. Forecast with durable ratios and treat the current price as a single input.
  • Six variables drive the bill: active users, session frequency, input and output tokens per session, retry rate, cache hit rate, agent multiplier.
  • Three ratios hold steady: output ~5× input, flagship ~15–30× economy, agentic ~3–10× simple RAG.
  • Tier routing is the biggest lever — most traffic to economy-tier, flagship only where it earns its keep.
  • 40–70% savings inside 90 days come from routing, semantic caching, prompt compression, output constraints, and hybrid search.
  • Re-forecast quarterly and monitor continuously. The price trend runs downward, so disciplined teams keep beating their own budgets.

How We Can Help

At SumatoSoft, we build AI systems where cost is something you design, not something you discover on the invoice. The framework here is Pillar 2 — Financial Governance — of our Agentic Development Lifecycle, the methodology behind every AI engagement we run.

What we deliver:

Why SumatoSoft: 14+ years on the market, 350+ delivered custom solutions, 25+ countries. ISO 27001 certified, with a bug-fix guarantee agreed upfront. 70% senior engineers, and AI practice leads who’ve shipped production systems across healthcare, fintech, logistics, and manufacturing.

Start in Three Steps

  1. Book a 30-minute token forecast session. Bring your use case, your current bill if you have one, and your projected user volume.
  2. Get a defensible monthly forecast — three scenarios, built on durable ratios rather than a price snapshot.
  3. Get a 90-day optimization roadmap with the specific techniques that will cut your cost 40–70%, ordered by impact and effort.

No qualification form to fill out first. Bring your numbers; we’ll bring 14 years of building this for other people.

Schedule your token forecast session →

Tags

Let’s start

You are here
1. Submit your project brief
2. Connect with our strategy team
3. Finalize scope & investment
4. Start achieving your goals

If you have any questions, email us info@sumatosoft.com

    Please be informed that when you click the Send button Sumatosoft will process your personal data in accordance with our Privacy notice for the purpose of providing you with appropriate information.

    Elizabeth Khrushchynskaya
    Elizabeth Khrushchynskaya
    Account Manager
    Book a consultation
    Thank you!
    Your form was successfully submitted!
    If you have any questions, email us info@sumatosoft.com

      Please be informed that when you click the Send button Sumatosoft will process your personal data in accordance with our Privacy notice for the purpose of providing you with appropriate information.

      Elizabeth Khrushchynskaya
      Elizabeth Khrushchynskaya
      Account Manager
      Book a consultation
      Thank you!
      We've received your message and will get back to you within 24 hours.
      Do you want to book a call? Book now