Pretraining

Also known as: pre-training, base model training, foundation training

TL;DR

Pretraining is the initial massive next-token-prediction phase that trains a language model on trillions of tokens of generic text. It's where an LLM acquires its broad capability — grammar, world knowledge, reasoning, code.

Pretraining is the first and most expensive phase of training a large language model . The recipe is conceptually simple: take a randomly initialized transformer , feed it trillions of tokens of text, and train it to predict each next token given everything that came before. Repeat for months on a cluster of thousands of GPUs. What comes out is a base model — a network with broad linguistic and world capability, but not yet shaped for any particular task.

The objective

Pretraining is supervised learning in disguise. The labels come for free from the text itself: at every position, the “right answer” is the token that actually appears next. Loss is the cross-entropy between the model’s predicted distribution and the one-hot true token:

Minimize this over a sufficiently diverse corpus and the model has no choice but to internalize an enormous amount about how language, reasoning, and the world work — because predicting the next token in arbitrary text requires that internalization.

What the model picks up

The empirical finding that makes LLMs work: at scale, next-token prediction generalizes far beyond syntax. A pretrained base model exhibits:

Capabilities that emerge from next-token prediction

World knowledge. Names, dates, scientific facts, code APIs — anything reasonably represented in the pretraining corpus.
Linguistic competence. Grammar, idiom, register, multi-language fluency proportional to data mix.
Latent skills. In-context learning, few-shot generalization, basic arithmetic, simple reasoning — none explicitly trained for; all emergent.
Programming. Code is a sizable fraction of modern pretraining mixes; base models can complete and reason about code without any code-specific training.

These capabilities are the substrate that everything downstream — instruction tuning, RLHF , domain fine-tuning — builds on.

Scale

Modern pretraining corpora run 1T–15T tokens. Models range from ~1B parameters (small open-weight) to multi-hundred-billion (frontier). The Chinchilla scaling laws (Hoffmann et al., 2022) suggested an optimal ratio of ~20 training tokens per parameter; current frontier runs go well past that ratio because inference-time efficiency favors smaller, longer-trained models.

The compute budget is dominated by the size × tokens product. A 70B-parameter model trained on 15T tokens does roughly floating-point operations — months on tens of thousands of accelerators.

For a transformer with parameters trained on tokens, the forward pass costs about FLOPs per token (each parameter participates in one multiply-add, which is two FLOPs). The backward pass is roughly twice the forward pass — gradients with respect to weights and gradients with respect to activations — for another FLOPs per token. Total: . The Kaplan and Chinchilla scaling-law papers both work in these units. Practical implication: a 1B model on 1T tokens is FLOPs, fits a few weeks on a small cluster; a 70B model on 15T tokens is a thousand-fold larger, requires months on tens of thousands of accelerators, and is why frontier pretraining costs tens of millions of dollars in compute alone — before you count the engineers, the data pipeline, or the failed runs.

Pretraining vs everything that follows

A base model knows a lot but isn’t useful: ask it a question and it’ll continue the question rather than answer it, because pretraining never showed it that “questions get answered”. Turning a base model into a chatbot, a reranker , or an embedding model requires post-training — instruction tuning, preference optimization, distillation , or task-specific fine-tuning . That second phase is where the model becomes a product, but it would all collapse without the broad capability pretraining laid down.

Perplexity on held-out text is the standard pretraining eval — lower is better, but downstream task performance is what actually matters once the base model is shipped.

Pretraining buys broad capability; everything else just shapes it. The base model is the foundation — and like any foundation, it limits what can be built above it.

Go further

Why is pretraining so much more expensive than fine-tuning?

Pretraining touches every parameter for trillions of tokens — a single frontier pretraining run consumes tens of millions of dollars in compute. Fine-tuning runs on millions of tokens, often updating only ~1% of parameters via LoRA. The cost ratio is typically 1000–10000× in pretraining's favor.

Fine-tuning Knowledge distillation

What does the model actually learn from next-token prediction?

More than the objective suggests. To predict the next token in a sentence about quantum mechanics or a Python function or a legal contract, the model has to encode something like the underlying domain. Pretraining at scale converts a syntactic objective into broad implicit knowledge — that's the surprising emergent property that makes LLMs work.

Large language model Transformer

Where does pretraining data come from?

Public web crawls (Common Crawl, FineWeb), curated books and papers, code repositories (GitHub), Wikipedia, and increasingly synthetic data generated by stronger models. Data quality and deduplication matter as much as quantity — a smaller, cleaner corpus often outperforms a larger noisy one.

Tokenization Perplexity

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs