Post-Mortem April 25, 2026 ⏱️ 5 min read

The 0.99 Bar Was Unreachable

TL;DR

🎯 The job: Inject animated CSS visuals into 40 static-only blog posts. Orchestra v2 ran the full 5-phase route on the first pass and shipped working HTML to disk
📉 The catch: Composite came in at 0.50, ship bar was 0.99. Loop-Judge ruled loop three times in a row — two of them were procedural no-ops chasing a gap the worker couldn’t close
🔧 The fix: Lower the ship threshold from 0.99 to 0.90. A 0.99 bar on a subjective visual task is a permission slip to loop forever

Seven Dim Scores · Asymptote & Ceiling

composite: 0.42 · scroll — ceiling drops 0.99 → 0.90 at the end

🏗️ The Job Was Done On Pass One

The task was simple to state: every blog post on this site should have an animated CSS visual that matches what the post is about. 40 of them were missing one. The other 7 already had something.

Orchestra v2 ran its full 5-phase route — Diagnose, Recon, Builder, Polish, Verifier — in one go. By the time scoring kicked in, the work on disk was correct: 40 posts modified, 7 untouched, every modified post carrying both an animated SVG figure AND a prefers-reduced-motion override for accessibility.

Core insight: The 5 phases are the route. Loops aren’t extra phases — they’re re-runs of the last two (Verifier and Polish) chasing a higher score on work that already exists.

The 5-phase route, ordered. The penalty entered at step 5.

>> BEFORE pass zero

+-- 9 of 47 posts had any visual
+-- 38 posts static, no figures
+-- zero accessibility coverage
+-- design language inconsistent

+-- AFTER pass one

[*] 49 of 49 posts have visuals
[*] 40 posts modified in one go
[*] every modified post has prefers-reduced-motion
[*] +7,071 / -40 lines on disk

The job that triggered three judge loops was already shipped to disk.

🔁 Three Loops, Two No-Ops, One Real Lift

The composite came in at 0.50. The ship bar was 0.99. That’s nowhere near, so the runner woke up the Loop-Judge to decide whether to ship anyway or send work back through.

Three Loops · Two No-Ops · One Real Lift

0.50

composite

Click L1 / L2 / L3 · only L3 (Polish) actually moves the needle

Three loops, +0.12 total. Still half a foot from the ceiling.

Loop 1 — Verifier (no-op)

Judge picked Verifier, asked it to broaden visual coverage from 9 of 40 posts to all 40. Verifier re-ran, re-stated the original 9 archetypes, and wrote a byte-identical report.

Loop 2 — Verifier again (no-op, predicted)

Judge looked at the round-1 result, noticed it was a procedural no-op, and warned in its own verdict: “if attempt 3 also returns identical artifacts, that’s strong signal we’ve hit a protocol diminishing-returns wall.” Then it picked Verifier a second time anyway, hoping the sharper instructions would land. They didn’t. verify.md mtime never updated.

Loop 3 — Polish (real lift)

The third judge changed the target. Instead of Verifier, it pointed Polish at the weakest dim — ambition at 0.78 — and Polish actually delivered: terminal animation hard-cap dropped, dead code stripped, eight near-identical CSS rules collapsed into one selector, pyflakes silent.

Real-world analogy: Olympic gymnastics judging where the scale tops out at 10.0 but every routine has a 0.3 deduction baked in for breathing. You can do a flawless routine and still score 9.7. The bar isn’t reachable; the judges are just calibrated to never give it.

Where the loops actually went.

📊 What “Composite 0.50” Actually Means

The Banded Ship Policy · Drag the Handle

BandAMBER · JUDGE

Runner doesjudge decides

Frictionone judge call

Drag the green dot · arrow keys also work · amber ghost = the old 0.99 bar

The composite isn’t the average of the dim scores. It’s a weighted score with a first-loop cap at 0.80 and additional penalties for unverified work.

So the 0.50 number that triggered every loop wasn’t the dim scores being bad — those ranged 0.78 to 0.93. It was the rubric’s structural protections firing because verification only sampled 9 of 40 posts. Every loop would have to either close that gap (Verifier’s job) or move the dim scores up so the composite punched through the cap (Polish’s job).

Layer	Number	Meaning
Raw composite	0.50	Weighted score with first-loop cap and unverified-work penalty
Cited dim range	0.78 – 0.93	What individual scorers actually awarded each dimension
Ship threshold (was)	0.99	Composite must clear this to auto-ship without judge involvement
Ship threshold (now)	0.90	Lowered after this run — reachable by good work, still rejects bad

>> HOW EACH NUMBER GETS COMPUTED

[ INPUT ]

DIM AVG

CAP

PENALTY

first-loop run

0.873

capped at 0.80

-0.30 unverified

post-loop run

0.873

cap lifts off

still -0.30 unverified

verified run

0.873

no cap

none

Same dim scores, three composites. The penalty is the gap.

>> orchestra-runner.log composite resolution

> dim_scorer fired +-- ambition 0.78 +-- polish 0.87 +-- docs 0.87 +-- testing 0.88 +-- visual 0.90 +-- correctness 0.90 +-- completeness 0.91 > composite_resolver fired o-o weighted_avg : 0.873 +-- first_loop_cap : 0.80 +-- unverified_penalty : 9 of 40 sampled -0.30 +-- composite : 0.50 > ship_threshold 0.99 > composite 0.50 +-- decision: LOOP

The dim scores were fine. The cap and the penalty did the cutting.

💡 Why 0.99 Doesn’t Work For Subjective Tasks

A 0.99 bar makes sense for objective tasks where every defect is a fact. Did the migration apply? Did the test pass? Did the binary compile? Yes-or-no, no-defects-allowed, ship-or-don’t.

It doesn’t work for visual tasks where every scorer is allowed to invent objections. “The terminal archetype animates in stills but the keyframe sweep could be smoother.” “Library was lifted not designed.” “Single visual language across all 40 posts.” All true. None of them are bugs.

Use 0.99 when

The task has a binary pass/fail at the end. Tests pass or they don’t. Migration ships or it doesn’t. The scorers can only ding you for facts.

Don’t use 0.99 when

The task is subjective. Visual quality, ambition, polish on a generated artifact — the scorers will always find something. The loop will run forever.

Objective tasks have a floor. Subjective tasks don't.

🐛 The Bug We Patched Mid-Run

Mid-Run Patch · Step 4 ↔ Step 5

Left column = BEFORE (signoff never lands) · loops the swap into the AFTER state on the right

While the loops were running, we noticed chat.md was missing most of the worker signoffs. Workers were being killed by the reaper before they finished writing their signoff posts.

The cause was step ordering inside the spawn template. Workers were instructed to write result.json first, then their chat signoff. But writing result.json is what triggers the reaper to kill them — so the chat post never landed.

>> chat.md worker-signoff audit

$ grep -c "WORKER:" .orchestra/chat.md +-- 7 expected 40 $ ls -la .orchestra/runs/*/result.json | wc -l +-- 40 on disk, all complete $ cat .orchestra/runs/builder-12/result.json | tail -3 "exit": "ok", "wrote": ["claude-code-skips-tests.html"], "signoff_attempted": true $ # so the worker tried. the reaper got there first.

The signoffs existed in intent. The reaper killed the write.

BEFORE
↓
Step 4: Write result.json (triggers reaper)
↓
Step 5: Post signoff to chat.md (never executes)

AFTER
↓
Step 4: Post signoff to chat.md (lands first)
↓
Step 5: Write result.json last (reaper fires AFTER signoff is on disk)

Patched the template, bumped the skill version, and re-spawned the next worker with the fix. Earlier workers’ signoffs got reconstructed from their result.json files via a small backfill script.

Mid-run patch, no restart, no rerun

The chat.md gap was caught by reading worker output, not by a test. The fix landed by editing the spawn template, bumping the skill version, and letting the next worker spawn pick up the new template. Earlier workers' signoffs were reconstructed offline from their result.json files. Zero rework on the in-flight run.

📦 What Actually Shipped

Metric	Value
Tool	one-shot-orchestra-v2 (v0.5.1)
Phases run	Diagnose, Recon, Builder, Polish, Verifier (full route, then 3 loops)
Judge loops	3 (Verifier × 2, Polish × 1)
Composite (final)	0.50 raw · cited dims 0.78–0.93
Files modified	40 blog posts, +7,071 / -40 lines
Files skipped (correctly)	7 already-rich posts + index + 15 art gallery pages
Accessibility coverage	100% — every modified post has `prefers-reduced-motion`
Verdict	User-initiated ship after 3 loops + threshold lowered to 0.90

Two outputs from one run: code on disk + a config change.

🎯 The Lesson For Anyone Else Wiring A Scoring Loop

The loop is only as useful as the threshold makes it. Set the bar where good work clears and bad work doesn’t — not where perfect work clears, because perfect work doesn’t exist on a subjective task.

Auto-ship at 0.90. Human-judge at 0.50–0.89. Auto-loop below 0.50. The middle band is where the judge does its real job: deciding whether the gaps are worth closing or whether the work is genuinely shipped.

The 0.99 bar didn’t make the work better. It just made the run longer.

The threshold is the lesson

Auto-ship at 0.90. Human-judge at 0.50 to 0.89. Auto-loop below 0.50. The middle band is where the judge has a real job. The top band is where good work clears. Pull the line down to where good work actually lives.

Three bands. Each does one job. The judge stops being the bottleneck.

Run Orchestra v2 On Your Next Build

Lean session, fresh workers, automatic scoring loop — now with a ship bar that good work can actually clear.

Get Godmode How Orchestra v2 Works

← Silent Ship, Then Three Quality Lifters All Posts →

// promote_godmode

Got value from this post? Become an affiliate. Auto-approved in 60 seconds, 30 to 40% recurring commission, your audience gets 10% off automatically with code AFFILIATE10. 90-day cookie, monthly payouts.

Become an affiliate →