Understanding agents orchestration matters because it changes what's possible.

You’ve probably been using coding agents the way most developers do: open Claude Code or Cursor, describe what you want, watch it work, course-correct when it drifts. One task at a time, one conversation at a time, you in the driver’s seat.

That works. But it makes you the bottleneck.

Over the past year, a wave of tools and projects has promised to change this. Devin, OpenHands, Claude Code’s background tasks, Cursor’s background agents, OpenAI Codex — all positioning themselves as “autonomous” developer tools. The hype makes them sound revolutionary. The reality is more interesting and more useful: they are all different forms of the same thing. They are orchestrators — systems that structure how an LLM agent is invoked, given context, checked for correctness, and recovered from failure. They differ in how they handle those concerns, and those differences explain why some work and others don’t.

Understanding orchestration matters not because you need to build your own framework, but because it changes what’s possible. Manual mode means you supervise every step. Orchestration means agents work while you sleep, in parallel, on tasks you’d never sit through one by one. It’s not a speedup — it’s a qualitative shift in what a single developer can accomplish.

This post walks through three concrete approaches to orchestration, extracts the key design decisions they make differently, and gives you a practical path to start applying these ideas.

Three ways to orchestrate a coding agent

The easiest way to understand orchestration is to watch the same task handled three different ways and see what breaks at each level.

The task: add authentication to a web app. Create the user model, add login and signup routes, wire up JWT token handling, add auth middleware, and update the tests.

Approach 1: Just tell the agent everything at once

This is what most developers do today. You open a session with your coding agent and type something like:

“Add authentication to this app. Create a User model with email and password fields, add login and signup API routes, implement JWT token generation and validation, add auth middleware that protects existing routes, and update the test suite to cover all new functionality."

The agent starts working. It creates the User model, moves to the routes, gets the JWT logic mostly right. But somewhere around the middleware step, things start to go sideways. Maybe the context window is getting full and the agent loses track of the exact middleware requirements. Maybe it decides the tests aren’t important and skips them. Maybe your session crashes — and when you restart, the agent has no memory of what it already did or what’s left.

This is context-window orchestration: the agent’s working memory is the only place the plan exists. There is no external record of what needs to happen, what already happened, or what went wrong.

The failure modes are predictable:

Context overflow. Complex tasks fill the context window. Auto-compaction kicks in and silently rewrites or drops parts of your original instructions. The agent continues confidently, working from a corrupted plan.
Selective execution. The agent decides some steps are unnecessary or “already handled” when they aren’t. You won’t notice until you look at the result.
Crash = total loss. If the session dies, the plan dies with it. You restart from zero, re-explain everything, and hope the agent doesn’t make different mistakes this time.
No checkpoint. There’s no clean boundary between “step 3 is done” and “step 4 is starting.” A failure mid-step leaves the codebase in a half-mutated state that’s hard to reason about.

This approach is fine for small, single-step tasks. It breaks down the moment complexity or duration increases.

Approach 2: Put the plan in a file and loop

This is the idea behind the “Ralph” technique, named by developer Geoffrey Huntley after using it to build an entire programming language with a coding agent.

The key insight: take the plan out of the agent’s head and put it in a file.

You create a plan.md that describes exactly what needs to be done — the same auth task, but written as a checklist with specific acceptance criteria for each step. Then you run the coding agent in a loop: each iteration, the agent reads the plan file, picks the next unchecked item, does the work, commits the result to git, and checks the box. If the agent crashes or makes a mess, you reset to the last commit and the next iteration picks up where things left off.

What changes:

The plan survives the agent. Context window overflow, auto-compaction, crashes — none of them touch the plan file. It persists on disk.
Each iteration starts clean. A fresh context window every time. No accumulated confusion, no compaction artifacts. The agent reads the current plan, sees what’s done and what isn’t, and focuses on one thing.
Git as a safety net. Each successful step is committed. A failed step is rolled back via git reset --hard. The codebase is never left half-mutated.
The human can steer between iterations. You can edit the plan file while the loop runs — reprioritise, add detail, remove steps that turned out to be unnecessary.

This is a real improvement. The plan is persistent, iterations are transactional (succeed and commit, or fail and roll back), and a crash costs you one iteration instead of everything. Huntley used this approach to run agents overnight, unsupervised, producing reviewable branches of committed work by morning.

But there’s a gap, and it’s deeper than self-evaluation. In the Ralph loop, everything — including critical checks like “did the full test suite pass?” — happens because the prompt tells the agent to do it. The agent can decide not to. Or it can run the tests, see failures, and rationalise them as unrelated to its changes. Anything you want guaranteed to happen has to be executed by the orchestrator itself, not by instructions to the model. Checks run by code always run. Checks run by prompt are suggestions. A wrapper script that runs npm test after every iteration and resets on failure catches broken tests every time; a prompt that says “make sure tests pass” may or may not.

This is why the next step isn’t just adding a reviewer — it’s moving critical logic out of the prompt and into the orchestrator itself.

Approach 3: Separate the worker from the judge

This is the conceptual step that systems like StrongDM’s Attractor take. The idea, stripped to its core: don’t let the same agent that does the work decide whether the work is correct — or whether the check runs at all.

In the auth example: after the agent implements JWT middleware, a separate verification step checks whether the middleware actually rejects unauthenticated requests, handles expired tokens correctly, and returns the right error codes. This check isn’t written by the same agent that wrote the middleware. It comes from a pre-defined set of scenarios — concrete end-to-end test cases that describe what a working auth system looks like from the user’s perspective.

The scenarios are stored separately from the codebase. The agent can’t see them, can’t modify them, and can’t game them. They function like a holdout set in machine learning: the thing being optimised never gets to read the test it’s being graded on.

What changes:

Verification is independent. The thing that judges “did this work?” is separate from the thing that did the work. This catches “success but wrong” — the most common and most insidious failure mode with LLM agents.
Quality becomes measurable. Instead of “the agent said it’s done,” you get “14 of 17 scenarios pass.” That’s a number you can track across iterations, compare across approaches, and use to decide whether to ship.
The human’s job shifts. You stop reviewing code line-by-line and start writing scenarios — descriptions of what correct behaviour looks like. This is a higher-leverage activity: one good scenario catches a class of bugs, while one line-by-line review catches one instance.

This is harder to set up. You need to write meaningful scenarios before the agent starts working, and the scenarios need to be specific enough to catch real problems without being so brittle that they break on irrelevant details. But when it works, it’s the difference between “I hope the agent got it right” and “I can prove the agent got it right.”

What you lose

Going from manual to autonomous isn’t free, and pretending otherwise would be dishonest.

In manual mode, you see every decision the agent makes. You catch the bad ones in real time. You bring judgment that no automated check can replicate — you know which trade-offs are acceptable, which corners can be cut, which “passing tests” are actually testing the wrong thing.

When you hand that over to an orchestrator, you lose fine-grained control and real-time judgment. You gain throughput and parallelism. The design challenge — really, the only challenge that matters — is making the automated verification good enough to compensate for the loss of human eyes on every step. The three approaches above are points on that spectrum: context-window orchestration barely tries; the loop pattern adds structural safety nets; scenario-based verification attempts to replace human judgment with something independently trustworthy.

No approach fully replaces a skilled developer paying attention. The bet is that one developer paying attention to the right things (specs, scenarios, architecture) and delegating the rest to orchestrated agents produces more than one developer writing every line by hand.

Five questions that define every orchestrator

The three approaches above look different on the surface, but they’re all making the same set of design decisions — just answering them differently. If you understand these five questions, you can evaluate any orchestration tool or technique, including ones that haven’t been built yet.

1. Where does the plan live?

This is the most basic question and the one with the biggest impact on reliability.

In the agent’s context window. This is approach 1. The plan exists only as tokens in a conversation. It’s convenient but fragile — compaction can rewrite it, crashes destroy it, and there’s no external record of what the agent was supposed to do.
In a file on disk. This is approach 2 (Ralph). The plan persists independently of the agent process. Crashes, context resets, even switching to a different agent — the plan survives. This is the single biggest reliability upgrade most developers can make.
In a durable workflow engine. This is where production systems are heading. Tools like Temporal record every step of a workflow so that a crash is recovered automatically — the engine replays from the last checkpoint without human intervention. OpenAI’s Codex and Replit’s Agent both run on Temporal in production for exactly this reason.

The pattern: the further the plan lives from the agent’s ephemeral context, the more resilient the system is.

2. Who writes the plan?

You write the full plan. You specify every step, every acceptance criterion, every constraint. The agent is a pure executor. This gives maximum control and maximum spec-satisfaction, but it’s labour-intensive. It’s what Huntley does with Ralph.
The agent writes the plan from a goal. You say “add auth” and the agent figures out the steps. This is flexible but risky — the agent’s plan may miss requirements you assumed were obvious.
Hybrid. You write the high-level requirements and acceptance criteria; the agent decomposes them into steps. This is the sweet spot for most real work. You define what and how good; the agent figures out how.

For spec-satisfaction specifically: the more precisely the human defines “done,” the more likely the output matches what was wanted. Agents are good at executing well-specified plans; they’re unreliable at inferring unstated requirements.

3. How do you know it worked?

This is the question that determines output quality more than any other.

The agent says so. The weakest signal. The agent declares it’s done, and you trust it. This is what context-window orchestration defaults to.
Tests pass. Better. A concrete check runs after each step. But if the agent wrote both the code and the tests, the check is self-referential — the agent is grading its own homework.
Something independent checks. Best. Scenarios written before the agent starts, stored separately, evaluated by a different process. This is the only form of verification that reliably catches “success but wrong.”

There’s a principle here worth stating explicitly: as human oversight decreases, verification must get stronger. In manual mode, you are the verification. You read every diff, you run the app, you exercise the edge cases. When you step away and let an orchestrator run, something must fill that role. If nothing does, quality drops — not immediately, but inevitably.

This is why many autonomous coding projects produce code that looks right but isn’t. The orchestrator runs, the agent reports success, tests pass — but the tests don’t cover the cases that matter, and nobody checked.

4. What happens when it breaks?

Every agent will eventually fail. The question is what the system does next.

Start over. Context-window orchestration’s default. The session crashes, you restart from scratch. Expensive.
Retry the step. The loop pattern’s answer. Roll back the last commit, try again from the last known good state. Cheap, and it works surprisingly often because LLM outputs are non-deterministic — the same prompt may succeed on the second attempt.
Compensate and continue. More sophisticated systems can undo the effects of a failed step (revert a database migration, cancel an API call) and try an alternative path. This is where workflow engines like Temporal shine — they track what happened and can unwind it.
Escalate. When automated recovery fails, surface the problem to a human with enough context to fix it. This requires good observability (see question 5).

The practical takeaway: even a simple retry-on-failure mechanism (the Ralph loop’s git reset --hard on error) is dramatically better than nothing. Most developers skip this entirely and lose hours to avoidable restarts.

5. When do you step in?

This is ultimately a trust question, and the honest answer is: it depends on how strong your answers to questions 1-4 are.

Every step. This is manual mode. Maximum quality, minimum throughput.
At defined checkpoints. The agent works autonomously through a block of steps, then pauses for your review before continuing. A good middle ground when you’re building trust in a new workflow.
Only at the end. The agent completes the full task; you review the final result. This works when you have strong verification (question 3) and good recovery (question 4).
Only when the agent is stuck. The agent runs indefinitely and flags you when it hits a problem it can’t resolve. This requires the agent to know what it doesn’t know — a hard problem.
Never. This is the endpoint some teams are exploring — fully autonomous with scenario-based verification as the only quality gate. It’s achievable for well-specified, well-tested domains. It’s reckless for everything else.

Most developers should start somewhere in the middle and move toward less oversight gradually as they build confidence in their verification and recovery mechanisms. Jumping straight to “never” is how projects join the 40% that get cancelled.

Applying this in practice

Understanding these concepts is necessary but not sufficient. The gap between “I understand orchestration” and “I can orchestrate effectively” is filled by practice, and the practice has to be structured or you’ll learn the wrong lessons.

Start where failure is cheap. Your first orchestrated agent run should not be on your production codebase. Pick a side project, a throwaway experiment, a greenfield feature branch. Something where a bad result costs you an afternoon, not a release.

Add one concept at a time. The most common mistake is jumping from manual mode to a full orchestration setup in one step. When it breaks — and it will — you won’t know whether the problem is your plan file format, your loop logic, your verification, or the agent itself. Instead:

First, try just externalising the plan. Write a plan.md, give it to the agent at the start of each session, and manually check the boxes. This alone will improve your results because it forces you to think through the steps before the agent starts.
Once that feels natural, add the loop. Run the agent repeatedly against the plan file, with git commits as checkpoints. See how far it gets unsupervised.
Once the loop is stable, add verification. Write scenarios that define “done” independently of the agent’s own judgment. See how often the agent’s “done” matches your scenarios' “done.”

Each step teaches you something that reading about it cannot. The plan file teaches you how much specification agents actually need (more than you think). The loop teaches you how agents fail (usually by confidently doing the wrong thing, not by crashing). Verification teaches you what “correct” actually means in your domain (usually more nuanced than your test suite captures).

Don’t over-invest in tooling before you understand the concepts. You don’t need LangGraph, Temporal, or a custom framework to start. A markdown file and a bash while-loop will teach you 80% of what matters. Upgrade to more sophisticated tools when you hit a specific limitation, not because the tool looks impressive.

Write better specs, not more code. The biggest leverage shift in autonomous orchestration is that the human’s job moves from writing code to writing specifications. A precise spec with clear acceptance criteria will produce better autonomous results than a vague spec with a sophisticated orchestrator. Invest your time accordingly.

Where this is heading

Three trends are converging.

Orchestration is becoming declarative. Instead of writing code that tells an agent what to do step by step, developers are writing specifications that describe the desired outcome and letting orchestration engines figure out the execution plan. This is the trajectory from bash loops to workflow graphs to spec-first tools.

Durability is becoming infrastructure. The “what happens when it crashes” problem is being absorbed by platform-level tools. Temporal and similar engines make crash recovery automatic and invisible. Within a year, “durable agent execution” will be a checkbox feature, not an engineering project.

Verification is becoming the bottleneck. As orchestration and durability become commoditised, the hard problem shifts to: how do you know the output is correct? Writing good scenarios, defining meaningful acceptance criteria, building verification that catches “success but wrong” — this is where the real skill gap will be. The developers who can specify what “correct” means precisely enough for automated checking will get dramatically more leverage from autonomous agents than those who can’t.

The common thread: the value is shifting from writing code to specifying outcomes. The agents will write the code. Your job is to define what good looks like, clearly enough that a machine can check it.

That’s not a future prediction. It’s already happening. The question is whether you start learning now, one concept at a time, or later, when the gap is wider.