Build Pivot May 12, 2026 ⏱️ 5 min read

The Orchestra Just Refused To Ship Garbage.

TL;DR

💥 The wake-up: A multi-agent run delivered a half-broken WebGL game at composite score 0.50. Four scorers had been throttled and silently passed.
🛡️ The fix: Three quality gates. Throttled scorers get retried. A Sentinel sweeps the finished artifact. A hard floor at 0.75 fails the run outright if quality won't lift.
✅ The contract: Orchestra v0.28 cannot ship below 0.75, cannot ship with empty scorers, and cannot ship with defects a Sentinel agent can see.

Old runner. Empty scorers averaged into a fake 0.50 and shipped.

💥 The wake-up call

A run titled "Glitter the Reef" finished. The orchestra reported delivered, composite 0.50, status green. The artifact was a WebGL turtle game.

The tab crashed mid-play. Twice. The on-screen joystick had no dive button. The page copy was peppered with em-dashes and developer notes meant for the README. The Polish phase had even invented an i18n setting that did nothing.

We don't have a quality problem. We have a structural inability to fail. — post-mortem note, May 2026

⚙️ Why it broke

Four scorer agents had hit the CPU throttle. The pipeline refused to spawn them, wrote an empty result, and moved on. The scoring math treated those empty results as a baseline 0.50.

Four real scores at 0.60 plus four empty 0.50 placeholders averaged to roughly 0.55. The threshold to ship at the loop ceiling was 0.50. The system delivered.

Core insight: The orchestra wasn't lying. It just had no concept of "an empty scorer is not a score." Silence read as a polite shrug, and a polite shrug averaged into a passing grade.

🔍 The factory line analogy

Picture a factory line with a quality inspector at the end. The inspector stamps every box that reaches them. If the inspector is on lunch, the box still ships, just without a stamp.

The fix isn't a smarter inspector. The fix is a physical gate the box cannot pass without three real stamps, plus a chief inspector who looks at the finished box one more time.

👮

Inspector retry

If a scorer was throttled, re-spawn it. Up to 3 retries. Empty stamps don't count.

🔍

Chief inspector

A Sentinel agent reads the finished artifact end to end. Defects get pinned.

🚫

Hard floor

Below composite 0.75 at the attempt ceiling, the run fails. No ship.

🧰

FixIt swarm

One worker per defect. Then the chief inspector looks again.

Gate 1 detects an empty scorer and stamps it for retry. Up to 3 attempts.

🛡️ Gate one. Throttled scorers do not count

Every scorer result now carries a throttle flag. The runner refuses to compute a composite if any flag is set. The throttled scorers get re-spawned with the same brief.

Three retries per dimension. Judges get two. If retries run out and the real composite still sits below 0.75, the run flips to a new terminal state called failed_quality. The artifact does not move.

📟 Scorer refused (CPU throttled)
↓
🔍 Runner spots the empty result
↓
🔁 Re-spawn that one scorer (attempt 1 of 3)
↓
📊 Real score returned, composite recomputed
↓
✅ Pass the gate, or fail_quality

🔎 Gate two. A Sentinel reads the finished work

Right before the run can flip to delivered, a fresh-context agent called the Sentinel reads every artifact. It hunts for defects the scorers were never told to look for.

Banned phrases in player copy. Em-dashes. Missing controls. Broken links under file:// versus http://. WebGL games shipping without crash recovery. Documentation that promises features the product doesn't have.

Sentinel sweeps the finished artifact. Each defect gets a red pin and a category.

Each finding becomes one line in work/product-bugs.jsonl. Severity, area, summary, affected files, suggested fix, and a fixable_by field flagging whether a FixIt agent can patch it or whether the operator needs to step in.

🧰 The FixIt swarm

Every fixable finding gets its own FixIt agent. Fresh context, one defect, one minimal patch. The fixer reports back, then Sentinel runs again.

One FixIt per defect, then the Sentinel runs again. Two unresolved passes and the run fails.

Two Sentinel passes is the ceiling. If defects survive a FixIt round and a re-Sentinel still finds problems, the run fails quality. No fallback ship.

Do

Let the Sentinel sweep before delivery. Catch the bug while the build context is still warm.

Don't

Trust dimension scorers to find category defects. They score what they were given. The Sentinel scores what shipped.

🧹 Clean shutdown was part of the bill

The bad run kept its background daemons alive for hours after delivery. The tree viewer was still animating. The watchdog was still polling for workers that no longer existed.

Terminal-state cleanup now writes four stop sentinels on disk. Each daemon polls for its own sentinel and exits the moment it appears. The same files also catch the new failed_quality state, so a failed run shuts down as cleanly as a delivered one.

📜 The new contract

Gate	What it checks	What happens if it fails
Throttle	Every scorer returned a real result, not an empty throttle stub.	Re-spawn the scorer, up to 3 attempts.
Floor	Composite score at the attempt ceiling sits at or above 0.75.	Run fails with state `failed_quality`.
Sentinel	No fixable defects survive a final read of every artifact.	FixIt swarm patches each finding, then re-Sentinel.

0.50 deflects into FAILED_QUALITY. Only a clean, in-spec artifact reaches SHIPPED.

The one-line version: Orchestra v0.28 will not ship below 0.75, will retry any throttled scorer, and runs a defect sweep on the finished work before it calls a run done.

🚀 Try it

Version 0.28.0 ships with the new gates wired into the runner. The skill bumps from v0.27.1 to v0.28.0; the runner moves from 1.13.0 to 1.14.0-quality-gates. Existing runs keep working. New runs route through the gates automatically.

Use Orchestra inside Claude Code

One-Shot Scripts ships with the orchestra protocol. Hand it a hard task, walk away, come back to a delivered run or a clean failure.

See One-Shot Scripts Pricing

← The Eight-Hour Silence All posts →