Cost per Token

Q: When does running your own model beat a hosted API?

When sustained QPS justifies the GPU. Rule of thumb: at FORMULA0.0002 per 1K output tokens versus $5-15/M from frontier APIs — break-even is at modest sustained load (5-10 req/s). Below that, paying API rates is cheaper. Above, self-hosting wins on cost but adds operational burden.

Also known as: price per token, token pricing, LLM economics

TL;DR

The economics primitive of LLM-driven systems. Per-token pricing — input and output, with output usually 3-5× input — is what makes a feature financially viable or not. Production decisions are dominated by this number more than any other.

Cost per token is the economics primitive for any system built on LLM calls. Almost every production decision — model choice, prompt structure, caching strategy, retrieval depth, agent loop length — comes back to per-token cost. If you’re not modeling it, you’re flying blind.

Almost every production decision in LLM systems is, at root, a question about per-token cost. The teams that succeed model it explicitly; the teams that fail discover it on the invoice.

The pricing shape (mid-2026)

Per million tokens, frontier API pricing is roughly:

Model	Input	Output
GPT-4o, Claude Sonnet 4	$3-5	$10-15
GPT-5, Claude Opus 5	$15-30	$50-75
Cheap tier (Haiku, Mini)	$0.25-1	$1-5
Open-weight (DeepSeek, Llama via Together/Fireworks)	$0.20-0.50	$0.40-1

Two patterns:

Output is 3-5× input. Reflects the prefill (parallel, fast) vs decode (sequential, slow) cost asymmetry.
Frontier ↔ cheap tier is ~10×. Same infra cost shape; capability tier dominates pricing.

What dominates production cost

Counterintuitively, input tokens usually dominate in production retrieval pipelines:

A RAG query has 5K-30K input tokens (system prompt + retrieved docs + user question) and 200-1000 output tokens.
Even at 3× output multiplier, input tokens.
Verdict: input cost dominates by 10×.

This flips for chat without retrieval (more output, less input). It flips again for code generation (long output, short input). Always profile your actual traffic shape.

What moves the needle

Levers ranked by typical cost reduction

Caching . Prompt-prefix caching (Anthropic, OpenAI) charges cached input at 10-25% of normal rate. For agent loops with long stable prompts, this is a 4-10× cost reduction.
Context compression . Send 5K relevant tokens instead of 50K retrieved tokens. 10× input cost reduction. Quality goes up (less noise) when compression is well-tuned.
Model tier routing. Cheap tier for narrow tasks, frontier for hard reasoning. Use a router (cheap classifier) to pick. 5-10× cost reduction on traffic mix.
Specialized small models. Reranker calls cost ~1 for a frontier-LLM relevance judgment. 100× cost reduction on the retrieval-scoring portion.
Quantization, continuous batching , speculative decoding . When self-hosting: each is a 1.5-4× cost win.

The specialized-model arithmetic

A 4B cross-encoder reranking 100 candidates costs roughly 0.10-1.00 — a 100× gap. Applied across millions of retrievals a day, that arithmetic is the case for specialized small models over generalist LLMs at every layer of the stack where a narrow task admits a narrow model.

The mental model

Build a per-call cost model in a spreadsheet. Inputs: tokens in, tokens out, model tier, cache hit rate, retrieval depth. Output: 50K/mo for the LLM bill”, you have headroom. If it’s “$5M/mo”, every cost-saving lever above is worth implementing — probably more than once.

Per-call cost shape depends entirely on the input-to-output token ratio. A pure-retrieval workload with a long compressed context and a short answer is dominated by input cost, even at 3-5× output multipliers. A creative-writing workload with a tight prompt and pages of generation is dominated by output cost.

This flips which optimizations matter. For input-heavy workloads, prompt caching and context compression buy 5-10× wins; output-side optimizations like speculative decoding give 1.5-2× wins. For output-heavy workloads, the priorities invert — caching helps less, while speculative decoding and quantization-aware serving become the headline levers.

The single most useful diagnostic in a cost model is the per-call input/output ratio plotted across all live traffic. A bimodal distribution (some calls input-heavy, some output-heavy) implies you should consider a tiered architecture — different model tiers and serving stacks for the two modes.

Prefix caching saves on the cached portion of the input — typically 10-25% of normal input price. The win depends on three conditions: the prefix is long enough that prefill cost dominates the call, the prefix is stable across many requests, and the cache hit rate is high.

Common ways the win evaporates: a system prompt that includes a timestamp or session ID at the top (cache invalidates every call); a prompt where retrieved context is concatenated before the system prompt (the system prompt is “shadowed” by varying retrieval and never caches); routing across replicas that don’t share cache (low hit rate even with stable prompts). The fix is structural — put dynamic content after stable content in the prompt, and use sticky routing or a centralized cache.

Anthropic and OpenAI both expose prefix caching with a TTL of 5-10 minutes by default. Workloads with bursty traffic to the same prompt do well; workloads where requests are spaced out beyond TTL get no benefit.

Go further

Why is output usually 3-5× the price of input?

Input is processed in parallel (prefill) — fast, compute-bound. Output is processed sequentially (decode) — memory-bound, much lower throughput per GPU. Vendors price to reflect the compute cost: an output token consumes roughly 3-5× the GPU-time of an input token at standard batch sizes.

Throughput Continuous batching

What does prompt caching actually save?

Most providers charge cached input tokens at 10-25% of normal input price. For a long system prompt repeated across requests, that's a 4-10× cost reduction on the prefill portion. The win is biggest for agent loops with stable system prompts and varying user messages — exactly the workload that dominates production usage.

Caching strategies KV cache

When does running your own model beat a hosted API?

When sustained QPS justifies the GPU. Rule of thumb: at 0.0002 per 1K output tokens versus $5-15/M from frontier APIs — break-even is at modest sustained load (5-10 req/s). Below that, paying API rates is cheaper. Above, self-hosting wins on cost but adds operational burden.

LLM observability Cost per token

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs