Post-Mortem ⏱️ 5 min read

The Eight-Hour Silence: When Orchestra Stopped Spawning

TL;DR

🔇 The symptom: An Orchestra run spawned workers cleanly for eight hours, then every subsequent fan-out silently failed.
🔍 The forensics: Briefs written, but the shell scripts that launch the workers never appeared on disk. No process, no result file, no error.
🐛 The cause: Four bugs in one trench coat, each layered over the next so the symptom looked like nothing at all.
The fix: Make the spawn loop transactional, defer instead of die under load, lock out concurrent narrators, and trap shell-level failures with a loud sentinel file.

📊 Build Stats

MetricValue
Skillone-shot-orchestra
Version0.25.2 → 0.26.0 (runner 1.11.2 → 1.12.0)
Files changed12 (3 new, 9 modified)
Lines+1,031 / −48
Verification tests8 (run-lock, spawn trap, throttle bypass, concurrent-narrator block, transactional fan-out, JS syntax, shell syntax)
Commitaa32365d

🔇 The Eight-Hour Silence

Picture a long run on Orchestra: a migration job, eleven phases, three or four parallel workers per phase. For eight hours every spawn is clean.

Diagnose, Recon, Research times three, Planner, Builder times four, Merger. The chat log fills with green checkmarks.

Then the next phase starts, and nothing happens.

No claude.exe in Task Manager. No worker windows opening.

The narrator session keeps polling and waiting, polling and waiting. Eventually it gives up and declares the run partial-unverified.

The user looks at the chat log and asks the obvious question: where are the workers?

Core insight: The worst kind of automation failure is the one that looks like a long task. If a system can wait silently forever, sooner or later it will, and the user will only notice when they check back hours later.

🔍 Forensics on Disk

Orchestra leaves a paper trail. Every worker spawn is supposed to write three files in order: a brief-NAME.md (the prompt), a .prompt-NAME.txt (the wrapped command), and a .run-NAME.sh (the shell script that actually launches the worker process).

For the workers that vanished, only the brief existed. The other two files, and the worker process itself, never appeared.

📄 brief-Test-1.md ✓ written

📝 .prompt-Test-1.txt ✗ missing

🐚 .run-Test-1.sh ✗ missing

💀 No claude.exe process. No result file. No error. No log line. Nothing.

That gap, between "brief written" and "shell script written", was where the spawn pipeline died. Whatever broke, it broke quietly enough that not even the watchdog noticed.

🎯 Four Bugs in One Trench Coat

The investigation turned up four separate bugs, each one masking the others. Hover the failure tags below to see where each bug strikes the spawn pipeline.

Spawn pipeline · click a bug to highlight where it strikes
narrator orchestra spawn JS spawn loop spawn.js throttle check spawn-shell.js bash launcher orchestra-spawn.sh 3 1 2 4

🔧 Fix #1: A Loop That Survives Mid-Step Failure

The fan-out spawn loop iterated through N workers. If the third one threw, the loop aborted, workers four and five never got a chance, and the state file was never written.

The fix is simple in shape: each iteration runs in its own try/catch. Failures fall into three buckets (spawned, deferred, failed), and the state always gets written at the end with all three lists.

// for each worker N in the fan-out batch
let result;
try {
  result = spawnWorker({ runDir, name, briefPath });
} catch (err) {
  failed.push({ name, reason: err.message });
  continue;
}
if (result?.deferred) {
  deferred.push({ name, reason: result.reason });
  continue;
}
spawned.push(name);

One bad worker no longer takes the others down with it. The narrator gets a complete picture: who launched, who got pushed back, who failed outright.

🌊 Fix #2: Defer, Don't Die

Orchestra has a pre-flight throttle that checks CPU, RAM, claude.exe count, disk queue, and a few other signals. When the host is under pressure, the throttle blocks the spawn so the machine doesn't tip over.

The old behaviour: throw an exception. Combined with bug one, that exception killed the whole batch.

The new behaviour: write a throttled-result file, return { deferred: true, reason }, and let the caller decide. The caller (now transactional from fix one) drops the worker into the deferred bucket. The narrator's next orchestra spawn call picks them up after the host clears.

Rule of thumb: Transient pressure should produce a deferred state, not an exception. Exceptions in concurrent code propagate in unhelpful ways. A flag that says "try me again later" is cooperative.

🔒 Fix #3: One Conductor at a Time

The May 3 chat log confirmed it: when the original narrator hit a context limit, a handover spawned a second narrator session. Both were calling orchestra spawn on the same run-id, racing on the briefs directory, the state file, and the per-worker shell scripts.

Two conductors, one orchestra, predictable result. The fix is a per-run lock at <run-dir>/.narrator.lock: an atomic mkdir with the holder's PID inside it. Every mutating subcommand acquires the lock, runs, and releases.

Do

Atomic mkdir as the lock primitive. mkdir is atomic on Windows and POSIX. Stale-PID detection via process.kill(pid, 0) handles the case where a holder died abnormally.

Don't

Use a plain file with read-modify-write. There is always a window where two writers see no lock and both create one.

If a second narrator hits a locked run, it gets a clear envelope back: narrator_already_active, the holder's PID, and a hint to run orchestra unlock only if you're sure the holder is gone.

📢 Fix #4: Fail Loud or Don't Fail At All

The bash script that launches each worker had set -euo pipefail at the top. Any failure inside it caused a silent exit, because the JS caller spawned bash with stdio: 'ignore', detached: true, child.unref(), sending the exit code into the void.

Now the script installs a trap on ERR and EXIT. If anything inside the launcher fails before the worker actually launches, the trap fires and writes two files: a .spawn-NAME.error sentinel with the line number and the failing command, plus a loud result-NAME.json with status "failed" so the existing detection path picks it up immediately.

Status reads now check for the sentinel at the top of every poll. A spawn-time failure that used to take ten minutes (the heartbeat-timeout window) to surface now surfaces in seconds.

🎬 Before vs After

The interactive below replays a five-worker fan-out under realistic stress: worker three hits the throttle, worker four hits a hard error. Click "Old behaviour" then "New behaviour" to compare.

W1
idle
W2
idle
W3
idle
W4
idle
W5
idle
spawned0
deferred0
failed0
never tried5
Press Play. Old behaviour fails the whole batch on worker three. New behaviour completes with three spawned, one deferred, one logged loudly.

🏗️ The Lesson

Think of a factory production line. Each station has a green light when it's working and a red one when it's stopped, so the floor manager can spot a stalled station from across the building.

What broke on May 3 was the equivalent of a station whose worker walked off shift mid-task, but the green light stayed on. The line just kept feeding work into the dark, and the manager only noticed eight hours later when the warehouse next door called to ask why the conveyor was empty.

The takeaway: Reliable automation needs three things in concert. Surface errors fast (so the green light goes red the moment the worker leaves), recover gracefully from transient pressure (so the line pauses without crashing), and prevent two operators from racing the same job (so two managers can't both think they're running the floor).

The fix shipped as Orchestra 0.26.0 and is live in Godmode and the Ultimate Bundle. Existing runs benefit on the next orchestra spawn call, so if you're driving a long Orchestra run right now, you already have it.

Honest trade-off: the run-lock assumes the holder narrator is reachable via PID. On the same machine that's reliable, but a future shared-filesystem deployment would need a heartbeat file with a TTL on top of the mkdir lock.

Run protocols that don't go silent

Orchestra ships in Godmode and the Ultimate Bundle. Hardened spawn pipeline, deferred-and-retry under load, locked against concurrent narrators, loud about failure.

Get Access More posts