Multi-Agent Systems

Also known as: multi-agent, agent collaboration, agent orchestration, agent debate

TL;DR

Multi-agent systems use multiple specialized agents — different roles, tools, or models — coordinating to solve a task. Patterns range from a coordinator dispatching to specialists to debate setups where agents argue toward a better answer.

A multi-agent system runs multiple LLM-backed agents — typically with distinct roles, tools, prompts, or model choices — coordinating on one task. The motivation is the same as in human teams: specialization beats generalism on complex work, and decomposition makes each piece tractable.

Common patterns

Coordinator + specialists. A top-level agent plans and dispatches sub-tasks to specialist agents (a coding specialist, a research specialist, a data-analysis specialist). The coordinator integrates results. Cleanest, most-shipped pattern in production.
Sequential pipeline. Agents arranged in a chain — one drafts, the next reviews, the next polishes. Conceptually similar to a feature pipeline; popular in content workflows.
Debate / adversarial. Two or more agents argue different sides of a question; a judge agent (or majority vote) decides. Useful for hard reasoning problems where errors are uncorrelated across agents.
Marketplace / blackboard. Agents post partial solutions to a shared workspace; other agents read and contribute. Powerful in theory; messy in practice for debugging and bounding cost.
Hierarchical decomposition. Coordinators of coordinators: the top-level planner decomposes into sub-coordinators, who decompose further. Scales to large goals; risks explosive cost without aggressive budgets.

Multi-agent shapes that ship in production

Coordinator + 2-4 specialists with focused tool subsets (default deep-research pattern)
Sequential draft / critique / polish chains for content and code generation
Plan / execute split — frontier model plans, cheaper model executes each step
Triage router — small classifier agent dispatches to specialist agents by domain
Verifier-on-top — an independent agent re-checks the primary’s answer before return

Multi-agent wins are 90% context discipline and per-task model choice, not “cognitive division of labor”.

Why specialization actually wins

The naive assumption is that multi-agent benefits come from “cognitive division of labor” — like a human team where each person thinks differently. That’s mostly not what happens with LLMs (especially same-family LLMs). The real wins:

Context discipline. Each specialist has a focused system prompt and a focused tool subset. Less context means less attention dilution and more reliable tool selection .
Per-agent model choice. A frontier LLM for the planner; cheaper specialized models for narrow sub-tasks. Cost falls dramatically without quality loss.
Independent reasoning paths. In debate or multi-sample setups, having different agents (different models, different prompts) gives uncorrelated error patterns. Aggregate beats the best individual.

Where it breaks

Cost multiplication. N agents per turn = N× token spend, plus inter-agent message overhead. A 3-agent debate plus reflection is easily 8× a single-agent baseline.
Latency. Sequential coordination is slow; even parallel multi-agent has aggregation latency.
Coordination errors. The coordinator gets the sub-task description subtly wrong, the specialist solves the wrong problem, and the coordinator integrates the wrong-problem solution into the final answer. Hard to debug because each agent looks individually correct.
Same-model collusion. Multiple agents drawn from the same model often share blind spots — they confidently agree on the same wrong answer.

The crossover point is reliably earlier than teams expect. Published reports from production multi-agent deployments suggest the diminishing-returns inflection lands somewhere around 3-5 agents for most tasks, and the negative returns inflection (more agents → worse outputs) around 8-10. Two failure modes drive this: (1) coordinator confusion — the top-level agent’s context fills with sub-task summaries that increasingly resemble each other, and the integration step starts averaging across mediocre takes instead of picking the best; (2) error amplification — a misread sub-task at the coordinator gets passed to a specialist who confidently solves the wrong problem, and downstream agents take that wrong solution as ground truth. The pragmatic ceiling for a single coordinator-and-specialists tier is ~5 agents; deeper structures need explicit verification gates between layers.

The pragmatic stance

Most production multi-agent systems are “coordinator + 2-4 specialists, each with a focused tool set.” Debate, marketplace, and deep hierarchies are research-grade patterns that ship occasionally. What pays off: each agent gets a tight role, a focused context, and the cheapest model that can do its job. Frontier LLM as the coordinator; specialized models — including rerankers and context compressors — wherever the sub-task is narrow enough to swap in.

Go further

When does multi-agent beat a single agent with all the tools?

When sub-tasks have different optimal contexts (different system prompts, different tool subsets, different models). A single agent juggling 50 tools and conflicting personas underperforms a coordinator routing to specialists with focused contexts. The win is largely from context discipline, not from 'cognitive division of labor' the way humans do it.

Tool use System prompt

Does debate actually improve answers?

Sometimes — on tasks where errors are uncorrelated across agents (different prompts, different models, different reasoning traces). On tasks where the errors are shared (same model, same training data), debate produces the same wrong answer with more confidence. Use diversity of agents intentionally.

Self-consistency Reflection and critique

What's the failure mode that gets people every time?

Cost. Multi-agent systems multiply token spend by the number of agents involved per turn. A 3-agent debate with reflection is 6-9× the tokens of a single-agent baseline, and the quality lift is rarely 6-9× anything. Profile before deploying.

Agent loop Context compression

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs