Tokenization

Also known as: tokenizer, BPE, subword tokenization, byte-pair encoding

TL;DR

Tokenization is how raw text becomes numerical input for a language model — the input is sliced into tokens (sub-word units, typically 3–5 characters each), each token mapped to an integer ID.

Tokenization is the layer between human-readable text and the integer sequences a actually consumes. The tokenizer takes a string in, spits a list of integers out, where each integer corresponds to a sub-word unit in the model’s vocabulary. The reverse direction (integers → text) is decoding.

TOKENIZATION · TEXT → IDsThree ways to slice a string. Two are wrong.INPUT17 CHARSindistinguishableCHARACTERindistinguishableTOKENS17one token per letter — zero ambiguity, brutal length.WORD<unk>TOKENS1cheap when in-vocab, useless when not.SUBWORD · BPEindistinguishableTOKENS3common pieces stay whole; rare strings get split.one string in, integer IDs out — but the choice of slicing decides everything downstream.SUBWORD BPE · THE INDUSTRY DEFAULT

What a token actually is

A token is usually 3–5 characters of English text. Common phrases and word fragments — the, _token, ization, _function — each occupy a single token. Rare strings (a long identifier, a chemical name) get split across multiple tokens. Whitespace is typically prefixed onto tokens, which is why the tokenizer for the cat produces something like ["the", " cat"] rather than ["the", " ", "cat"].

The popular algorithms are byte-pair encoding (BPE — used by GPT, Claude), WordPiece (BERT), and SentencePiece (T5, Llama). They differ in details but share the core idea: start from characters/bytes, then iteratively merge the most common pairs into single tokens, until you have a fixed-size vocabulary (typically 30k–256k tokens).

Start with the byte-level alphabet (256 entries) plus the training corpus split into bytes. Count every adjacent byte pair across the corpus; the most frequent pair becomes a new token. Replace every occurrence of that pair in the corpus with the new token id, then repeat: count adjacent pairs (now over a mix of bytes and merged tokens), merge the most frequent, replace, repeat.

After K merges your vocabulary is 256 + K entries. The training output is the merge table — an ordered list of pair-merges to apply, in order, when tokenizing new text. Tokenizing “tokenization” with a trained BPE applies merges in that order: maybe t + o → to, then to + k → tok, then tok + en → token, then token + ization → tokenization if the merges were learned that way.

The greedy left-to-right application is what makes BPE deterministic and what causes the whitespace-sensitivity quirk: ” token” and “token” hit different merge paths because the leading space alters which pairs match first.

BPE · MERGE PROGRESSIONCommon words collapse; rare strings stay split.MERGES LEARNEDVOCAB SIZE →256 vocab1k vocab10k vocab50k vocabINPUT“tokenization”STEP 0tokenizationTOKENS12~1k MERGEStokenizationTOKENS4~10k MERGEStokenizationTOKENS2~50k MERGEStokenizationTOKENS1each step merges the most-frequent adjacent pair into a new vocab entry — frequent words decay to one token, rare strings stay multi-token.

Why it’s the load-bearing layer

Almost every constraint and cost in an LLM stack maps back to the tokenizer:

  • Cost. API pricing is per-token. The tokenizer determines how many tokens your text becomes.
  • . Max input length is in tokens, not characters or words.
  • Speed. Generation latency is roughly per-token. More tokens = more time.
  • Multilingual fairness. Languages under-represented in the tokenizer’s training data cost more tokens for equivalent meaning. Korean and Chinese routinely run 3–6× more expensive than English.
  • Special tokens. [CLS], [SEP], <|endoftext|>, the chat-template tokens — all are single tokens in the vocabulary that the model has learned to interpret as control signals.

Quirks that bite in production

Quirks that bite in production
  • Whitespace sensitivity. "reranker" and " reranker" are usually different tokens, with different embeddings. Tokenize carefully.
  • Number splitting. Many tokenizers split numbers digit-by-digit. The model has to learn arithmetic at the token level, which is part of why LLMs are bad at it.
  • Identifier fragmentation. A function name like getUserAccountSettings may become 4-6 tokens of mostly noise. Code-aware tokenizers (StarCoder, DeepSeek-Coder) treat these better.
  • Tokenizer mismatch. If you build your retrieval index using one tokenizer and query with another (e.g., re-using OpenAI text-embedding-3-large vectors with a different model), the embeddings won’t align.
  • Glitch tokens. Rare merges that the model barely saw during training (the infamous “SolidGoldMagikarp”) trigger pathological outputs when invoked. Modern providers prune these from chat surfaces.

In practice

You almost never write a tokenizer; you use one bundled with your model. The thing to know: count tokens, not characters, when budgeting context or cost. tiktoken (OpenAI), tokenizers (HuggingFace), and sentencepiece (Google) all have utilities to do this offline before you spend the API call.

Go further

Why don't LLMs just use characters or words?

Characters: too long. A 4000-token context becomes ~16,000 characters, which is trivially short for any real document. Words: too many. English has hundreds of thousands of distinct words; the embedding table would be enormous, and rare words like proper nouns can't be handled. Sub-word tokenization (BPE, WordPiece, SentencePiece) finds a sweet spot — common sequences become single tokens, rare ones get split into smaller pieces.

Why do tokens cost different amounts in different languages?

Most popular tokenizers (GPT's, Claude's) were optimized on English-heavy training data. English text averages ~1.3 tokens per word. The same translated content in Korean, Hindi, or Chinese can be 3–6× more tokens, because their characters don't compress as efficiently. Multilingual production stacks budget for this asymmetry.

Does the tokenizer affect retrieval too?

Yes, in subtle ways. [BM25](/concepts/bm25/) tokenizers are typically simpler (whitespace + lowercase + stem). Embedding models inherit their tokenizer from the underlying transformer, which means rare proper nouns, identifiers, or chemical names may get fragmented across multiple tokens — degrading retrieval signal on those terms unless your model was specifically trained for the domain.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord