Build Receipt ⏱️ 5 min read

We Built an AI Tug-of-War Arena in 73 Minutes. Here's the Receipt.

TL;DR

🪢 What we built: A head-to-head AI agent competition — submit a JS function, climb the ELO ladder, watch animated replays.
⏱️ How long it took: 1 hour 13 minutes, one /one-shot-scripts session, $85.67 in API spend.
📊 What it cost in trade-offs: One dimension (robustness) scored 0.88 — and we left the documented MVP trade-off in the code instead of hiding it.
JUDGE · live
ROPE x=0.50
AGT 1
[−o−]
<≡≡≡
AGGRESSOR
PULL ← ROPE → PULL
AGT 2
[−o−]
≡≡≡>
DEFENDER
LEDGER · APPEND-ONLY
    SIX-PIECE COORDINATION · ROPE IS THE TRUTH

    🪢 What Got Built

    The Agent Arena lives at /agent-arena/arena/ on getgodmode.dev. It's a sibling to the existing solo training ground levels, but the gameplay is head-to-head: your agent against another player's agent in a tug-of-war.

    The rope starts at position 50. Player A wants it at 0. Player B wants it at 100. Each round, both agents return a structured move — a stance (pull, brace, or sprint) and an effort 0–20. A deterministic engine resolves the round: pull beats brace, sprint beats pull, brace beats sprint, the winner's effort is multiplied by 1.5 and the loser's by 0.5. Each agent has 100 stamina to spend across the whole match. First past 0 or 100 wins.

    Update (2026-04-15): The original build used a Claude Haiku judge to score prose "pull" strings. We ripped it out the same day. The judge added latency, cost, and — more importantly — third-party scoring in what's supposed to be user-vs-user. The current engine is fully deterministic. The only AI involved is whatever the user uses to write their script.

    🧠 Round kicks off — both agents see rope, round, stamina

    ✍️ Each agent returns {stance, effort}

    🎯 RPS matchup × effort × stamina

    🪢 Rope moves, stamina deducted

    🏆 First past 0 or 100 takes the match — ELO updates
    ARENA AGT 1 Aggressor AGT 2 Defender JUDGE Score grader ROPE Shared state UI Live arena LEDGER Audit log
    SIX MOVING PARTS · EVERY MOVE FLOWS THROUGH THE ARENA

    Three weight classes scaffold the architecture, but only one is live. Middleweight ships fully wired: submit a JavaScript pull(state, history) function, the server runs three auto-seeded matches against random opponents, your ELO updates, you appear on the leaderboard. Featherweight (live browser lobby) and heavyweight (API webhooks) get coming-soon pages with the full pitch and a draft API contract.

    73
    MIN
    One session, one receipt

    A head-to-head AI tug-of-war arena: three weight classes scaffolded, middleweight wired end-to-end, ELO ladder, replay viewer. Built in 1 hour 13 minutes inside a single /one-shot-scripts session.

    📊 Build Stats

    This is what the run-report block printed at the end of the session — verbatim from node ~/.claude/scripts/run-report.js end one-shot-scripts:

    ─── Run Report ────────────────────────────── Skill/label : one-shot-scripts Started : 2026-04-14T22:29:32.480Z Ended : 2026-04-14T23:42:54.185Z Time taken : 1h 13m 21s Input tokens : 286 Output tokens : 347,988 Cache read : 23,125,939 Cache created : 1,326,955 Total tokens : 24,801,168 Estimated cost : $85.67 USD Assistant turns : 165 ─────────────────────────────────────────────

    The cache hit rate is the line that matters: 23.1M tokens read from cache against 1.3M created. Without prompt caching, the same session would have been a multiple of that cost. The protocol re-reads the same recon files, the same plan, the same task list across every phase — caching is what makes that affordable.

    recon plan migration edge fn pages tests ship t=0 t=8m t=18m t=34m t=52m t=64m t=73m
    Seven phases, 73 minutes wall-clock, 165 assistant turns end to end.
    cache reads 23.1M cache created 1.3M output 348K total 24.8M tokens · 165 turns · $85.67
    The 93%-cache slice is the only reason this session was affordable.
    LEDGER REPLAY · 73 MIN BUILD · 4× SPEED
      DRAG GAUGE TO SCRUB · ORANGE DOTS = JUDGE RULINGS
      prompt cache 23.1M reads recon files re-read each phase single session no re-pay $85.67 total spend
      Cache hit rate is what makes a 24.8M-token run finish at $85.67 instead of multiples of that.

      📋 Dimension Scorecard

      The /one-shot-scripts protocol blocks delivery until every rubric dimension passes 0.85 and the composite hits 0.92. Here's how the final loop scored:

      DimensionScoreNotes
      Correctness0.9336/36 unit tests + integration test of the replay viewer with mocked match data
      Completeness0.94Middleweight live + scaffolds for the other two tiers + replay viewer
      Quality0.92Mirrors existing obstacle-actions and migration 008 patterns exactly
      Robustness0.88Documented MVP trade-off — see "the honest one" below
      UX0.94All 7 pages visually verified via Playwright screenshots
      Documentation0.93README walks the deploy steps, security caveats, rollback
      Testing0.90Honest about what the watchdog catches and what it doesn't
      Composite0.92At threshold — shipped
      correctness0.93
      completeness0.94
      quality0.92
      robustness0.88
      ux0.94
      documentation0.93
      testing0.90
      Seven dimensions, all above the 0.85 floor. Robustness sits closest to the line — on purpose.
      ~/agent-arena/arena
      $ one-shot-scripts: ship middleweight + scaffold tiers
      phase 7 / 7, final scoring loop
      correctness ........ 0.93 ✓
      completeness ....... 0.94 ✓
      quality ............ 0.92 ✓
      robustness ......... 0.88 ⚠ documented
      ux ................. 0.94 ✓
      documentation ...... 0.93 ✓
      testing ............ 0.90 ✓
      composite 0.92 / at threshold, ship

      ⚠️ The Honest One: Robustness 0.88

      Middleweight scripts run inside new Function(...) in the Deno edge isolate. The protection layers are a static blacklist (no fetch, no Deno, no while(true), no Date.now busy-waits), a 400ms async watchdog, frozen inputs, and a per-user rate limit.

      Here's what the rubric flagged: the watchdog only catches asynchronous runaway. Single-threaded JavaScript can't preempt a synchronous busy loop from outside, so a script that bypasses the blacklist with clever obfuscation runs until Supabase's wall-time kill (~150 seconds) terminates the function.

      The trade-off we shipped with: Moving execution into a Deno Worker would give real CPU-time enforcement, but we couldn't validate that against Supabase Edge Functions locally. Deploying untested isolation code is worse than shipping documented MVP isolation. The upgrade path lives in arena/README.md.

      The score reflects an explicit choice, not a gap. We could have inflated it to 0.95 and called it production-ready. Instead the dimension stays at 0.88 and the README has a "what is NOT protected" section. That's the kind of honesty /one-shot-scripts is built to surface.

      COMPOSITE
      0.92
      SHIPPED · AT THRESHOLD
      ROBUSTNESS
      DOCUMENTED TRADE-OFF
      WHAT THE 0.88 BUYS YOU · OWNED IN PUBLIC
      • Synchronous busy loop bypasses the 400 ms async watchdog
      • Supabase wall-time kill (~150 s) is the only hard floor
      • Static blacklist won't catch obfuscated fetch / Date.now globals
      • Deno Worker upgrade path: untested vs. Edge Functions locally
      Upgrade path documented in arena/README.md. The 0.92 ships with eyes open, not closed.
      HOVER OR TAP THE 0.88 BAR · TRADE-OFF IS THE FEATURE, NOT THE BUG
      Async runaway
      Sync busy loop
      Static blacklist
      400ms watchdog
      Frozen inputs
      Per-user rate limit
      Supabase wall-time kill
      What the 0.88 actually buys you: the right column is the ~150-second hard floor.

      🏗️ The Real Analogy

      Think of /one-shot-scripts like commissioning a contractor who walks the site, draws the plan, builds, runs their own punch-list, screenshots every room, and only hands you the keys when their own checklist is green. Except they finish in 73 minutes and the punch-list is public.

      05
      CONTRACTOR
      Walk, draw, build, punch-list, screenshot, hand off the keys

      /one-shot-scripts is a contractor with a public punch-list. They walk the site, draw the plan, build, run their own checks, screenshot every room, and only deliver when their list is green. The difference: 73 minutes and the trade-offs are written down for you to see.

      📦 What Shipped — by the File

      FileLines
      supabase/migrations/013_obstacle_arena.sql113
      supabase/functions/arena-actions/index.ts745
      agent-arena/arena/index.html384
      agent-arena/arena/middleweight.html593
      agent-arena/arena/match.html439
      agent-arena/arena/featherweight.html148
      agent-arena/arena/heavyweight.html176
      agent-arena/arena/arena-client.js173
      agent-arena/arena/test-arena.mjs259
      plus an arena card injected into the existing course landing+12
      migration 013_obstacle_arena.sql · 113 lines
      edge function arena-actions/index.ts · 745 lines
      five HTML pages · 1740 lines
      arena-client.js + test-arena.mjs · 432 lines glue
      migration 013 113 lines SQL edge function 745 lines TS 5 HTML pages 1740 lines arena-client.js + test-arena.mjs 432 lines glue arena LIVE 3042 lines total
      Nine files line up into one live tier. Two more pitched, one wired.

      🔍 What the Protocol Caught That a Single Pass Wouldn't

      Two moments in the run earned their keep:

      Caught

      The async watchdog test failed on the first pass because Promise.race can't preempt a sync busy loop. The protocol forced the test to be honest about it instead of papering over with a bigger timeout.

      Caught

      The replay viewer was never actually rendered until Phase 7. A Playwright integration test with mocked match data caught it in time — and the screenshot proved the rope animation worked end-to-end.

      Both findings landed in the README's "security caveats" section and the test suite. Neither would have surfaced from a "build it and see" approach.

      phase-7 catches
      [catch 1] async watchdog can't preempt sync busy loop
      → test made honest, README "what is NOT protected" added
      → 0.88 robustness logged with documented trade-off
      [catch 2] replay viewer never rendered until phase 7
      → Playwright test with mocked match data
      → screenshot proves rope animation end-to-end
      single-pass: would have shipped both bugs

      🚀 Try the Arena

      Middleweight is live. Sign in with a forum account, paste a pull(state, history) function, and the server will run three matches against random opponents on the spot. ELO updates immediately. The leaderboard is empty as of writing — first agent in gets the rank-1 slot until someone takes it.

      The other two tiers are scaffolded with full pitches and (for heavyweight) a draft API contract. They ship next.

      Featherweight
      Middleweight
      Heavyweight
      Pitch page
      ELO ladder live
      Auto-seed matches
      Replay viewer
      API contract draft
      Open for submissions
      Middleweight ships fully wired. The other two tiers are scaffolded next.

      Enter the Arena

      Submit a tug-of-war agent, climb the ELO ladder, or just watch a replay.

      Enter the Arena See /one-shot-scripts →

      // promote_godmode

      Got value from this post? Become an affiliate. Auto-approved in 60 seconds, 30 to 40% recurring commission, your audience gets 10% off automatically with code AFFILIATE10. 90-day cookie, monthly payouts.

      Become an affiliate →