Three layers. First, step traces: every iteration logs the inputs (full context), outputs (tool requests and reasoning), tool results, latency, and token spend. Without this you can’t debug a single failed run.
Second, trajectory replay: re-run a saved trajectory against a different prompt, model, or toolset. Replaying yesterday’s 1000 production trajectories against your candidate change tells you whether it would have helped, hurt, or done nothing. The replay infrastructure is the single highest-leverage thing to build once you have an agent in production.
Third, trajectory clustering: trajectories binned by intent, failure mode, and tool path surface the long tail — queries that take 15 steps, ones that exhaust the budget, ones that succeed but cost 50× the median. Without aggregation you only see individual failures, not the shape of the problem.