Tokenization

Q: Does the tokenizer affect retrieval too?

Yes, in subtle ways. BM25 tokenizers are typically simpler (whitespace + lowercase + stem). Embedding models inherit their tokenizer from the underlying transformer, which means rare proper nouns, identifiers, or chemical names may get fragmented across multiple tokens — degrading retrieval signal on those terms unless your model was specifically trained for the domain.

Also known as: tokenizer, BPE, subword tokenization, byte-pair encoding

TL;DR

Tokenization is how raw text becomes numerical input for a language model — the input is sliced into tokens (sub-word units, typically 3–5 characters each), each token mapped to an integer ID.

Tokenization is the layer between human-readable text and the integer sequences a transformer actually consumes. The tokenizer takes a string in, spits a list of integers out, where each integer corresponds to a sub-word unit in the model’s vocabulary. The reverse direction (integers → text) is decoding.

What a token actually is

A token is usually 3–5 characters of English text. Common phrases and word fragments — the, _token, ization, _function — each occupy a single token. Rare strings (a long identifier, a chemical name) get split across multiple tokens. Whitespace is typically prefixed onto tokens, which is why the tokenizer for the cat produces something like ["the", " cat"] rather than ["the", " ", "cat"].

The popular algorithms are byte-pair encoding (BPE — used by GPT, Claude), WordPiece (BERT), and SentencePiece (T5, Llama). They differ in details but share the core idea: start from characters/bytes, then iteratively merge the most common pairs into single tokens, until you have a fixed-size vocabulary (typically 30k–256k tokens).

Start with the byte-level alphabet (256 entries) plus the training corpus split into bytes. Count every adjacent byte pair across the corpus; the most frequent pair becomes a new token. Replace every occurrence of that pair in the corpus with the new token id, then repeat: count adjacent pairs (now over a mix of bytes and merged tokens), merge the most frequent, replace, repeat.

After K merges your vocabulary is 256 + K entries. The training output is the merge table — an ordered list of pair-merges to apply, in order, when tokenizing new text. Tokenizing “tokenization” with a trained BPE applies merges in that order: maybe t + o → to, then to + k → tok, then tok + en → token, then token + ization → tokenization if the merges were learned that way.

The greedy left-to-right application is what makes BPE deterministic and what causes the whitespace-sensitivity quirk: ” token” and “token” hit different merge paths because the leading space alters which pairs match first.

Why it’s the load-bearing layer

Almost every constraint and cost in an LLM stack maps back to the tokenizer:

Cost. API pricing is per-token. The tokenizer determines how many tokens your text becomes.
Context window . Max input length is in tokens, not characters or words.
Speed. Generation latency is roughly per-token. More tokens = more time.
Multilingual fairness. Languages under-represented in the tokenizer’s training data cost more tokens for equivalent meaning. Korean and Chinese routinely run 3–6× more expensive than English.
Special tokens. [CLS], [SEP], <|endoftext|>, the chat-template tokens — all are single tokens in the vocabulary that the model has learned to interpret as control signals.

Quirks that bite in production

Whitespace sensitivity. "reranker" and " reranker" are usually different tokens, with different embeddings. Tokenize carefully.
Number splitting. Many tokenizers split numbers digit-by-digit. The model has to learn arithmetic at the token level, which is part of why LLMs are bad at it.
Identifier fragmentation. A function name like getUserAccountSettings may become 4-6 tokens of mostly noise. Code-aware tokenizers (StarCoder, DeepSeek-Coder) treat these better.
Tokenizer mismatch. If you build your retrieval index using one tokenizer and query with another (e.g., re-using OpenAI text-embedding-3-large vectors with a different model), the embeddings won’t align.
Glitch tokens. Rare merges that the model barely saw during training (the infamous “SolidGoldMagikarp”) trigger pathological outputs when invoked. Modern providers prune these from chat surfaces.

In practice

You almost never write a tokenizer; you use one bundled with your model. The thing to know: count tokens, not characters, when budgeting context or cost. tiktoken (OpenAI), tokenizers (HuggingFace), and sentencepiece (Google) all have utilities to do this offline before you spend the API call.

Go further

Why don't LLMs just use characters or words?

Characters: too long. A 4000-token context becomes ~16,000 characters, which is trivially short for any real document. Words: too many. English has hundreds of thousands of distinct words; the embedding table would be enormous, and rare words like proper nouns can't be handled. Sub-word tokenization (BPE, WordPiece, SentencePiece) finds a sweet spot — common sequences become single tokens, rare ones get split into smaller pieces.

Context window Large language model

Why do tokens cost different amounts in different languages?

Most popular tokenizers (GPT's, Claude's) were optimized on English-heavy training data. English text averages ~1.3 tokens per word. The same translated content in Korean, Hindi, or Chinese can be 3–6× more tokens, because their characters don't compress as efficiently. Multilingual production stacks budget for this asymmetry.

Cross-lingual retrieval Context window

Does the tokenizer affect retrieval too?

Yes, in subtle ways. [BM25](/concepts/bm25/) tokenizers are typically simpler (whitespace + lowercase + stem). Embedding models inherit their tokenizer from the underlying transformer, which means rare proper nouns, identifiers, or chemical names may get fragmented across multiple tokens — degrading retrieval signal on those terms unless your model was specifically trained for the domain.

BM25 Embedding Chunking

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs