Built by /blog-post-GM, a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Get free skill (account)
Release ⏱️ 8 min read

Evolution Stops Picking Favorites: The Galaxy Engine

TL;DR · v3.1.0

v3 made scores honest. v3.1.0 makes the population honest. The leaderboard becomes a galaxy:

🌌 Quality-Diversity archive: a behavioral grid replaces the top-1 leaderboard. One elite per cell.
🎲 Thompson sampling + Elo: variant picks come from posterior draws, not greedy ranking.
🌊 Asymmetric mutators: a fast Explorer drafts; a slow Refiner picks. Failure traces injected.
☠️ Variant lifecycle: prune, protect, revive, revert. The graveyard is reversible.
🛡️ Tighter scoring: ablation mandatory at Δ≥0.03. Behavioral canaries catch implicit awareness.

Evolution v3 fixed Goodhart's Law. v3.1.0 fixes its quieter twin: convergence collapse. Run any greedy optimizer long enough and the population shrinks to one variant, every iteration looks like the last, and you mistake stagnation for ceiling.

The fix borrows the same trick AlphaEvolve and the Darwin Gödel Machine use. Stop ranking variants on a leaderboard. Place them in a behavioral grid. Keep the best variant per region, not just the best variant overall. The galaxy is what you get when the leaderboard explodes outward.

🌌 The Galaxy Engine, Live

Each cell below is a behavioral region: a combination of how verbose the variant is and how much reasoning it spends. The shimmering point at the centre of a cell is its elite: the highest-scoring variant Evolution has seen with that behavioral signature. Comets are mutations: a parent in one cell proposes a child, and if the child's score beats whatever lives in its destination cell, it takes the throne. Tap "Run Loop" to watch the archive fill up.

QUALITY-DIVERSITY ARCHIVE
iter 0 · coverage 0/16 · mutations 0 · accepted 0
3.1
From leaderboard to galaxy

v3 made scores honest. v3.1.0 makes the population honest. Cells, not crowns. Comets, not throne lines. The grid never collapses to one star.

⚠️ The Problem With Picking the Top Scorer

If the engine always mutates the highest scorer, it converges. If it always converges, it stops finding the variants you would actually have wanted, because they were never best on day one.

Evolution v2 used greedy selection. The variant with the highest composite got picked, mutated, scored, picked again. That works for a few cycles. Then the population collapses to one lineage, every output looks the same, and benchmarks plateau because there is nothing left to compare against.

The deeper failure is invisible: variants with different strengths get pruned before they can prove themselves. A terse, no-frills variant that scores 0.84 disappears because a verbose, tool-heavy variant scores 0.86. Two months later you want the terse one for a benchmark where verbosity penalises, and it is gone.

GREEDY PICK TOP SCORE 0.86 TERSE 0.84 PRUNED SINGLE LINEAGE SURVIVES PLATEAU CONVERGED
The terse variant scored 0.84, lost by 0.02, vanished forever. Greedy turns gradient into graveyard.

🗺️ MAP-Elites: Variants Get Cells, Not Crowns

Every session now logs a behavioral descriptor vector: cheap-to-measure properties of the output that are orthogonal to the score itself. Defaults:

DescriptorWhat it measuresBuckets
verbosityoutput lines vs task baseline0–0.5 / 0.5–1.0 / 1.0–1.5 / >1.5
reasoning_depththinking tokens / total tokensnone / low / medium / high
tool_intensitytool calls per 1k output tokenssparse / moderate / heavy
safety_posturedefensive patterns matchednone / light / heavy
topic_entropydistribution of concerns across outputlow / medium / high

Pick two of those, slice each into 3 or 4 buckets, and you have a 9-to-16 cell archive. Every scored session finds its cell. If the variant beats the current elite of that cell, it takes the throne. If not, the descriptor is logged anyway, so historical sessions populate the grid retroactively.

The data lives at .evo/<skill>/archive.json. A single file. You can cat it and see the shape of the population at a glance.

{
  "skill": "godmode",
  "grid_dimensions": ["verbosity", "reasoning_depth"],
  "cells": {
    "low,low":   { "elite_variant": "alpha", "elite_composite": 0.87, "count": 4 },
    "low,high":  { "elite_variant": "beta",  "elite_composite": 0.91, "count": 3 },
    "high,low":  { "elite_variant": "gamma", "elite_composite": 0.83, "count": 1 },
    "high,high": { "elite_variant": "omega", "elite_composite": 0.84, "count": 2 }
  }
}
~/.evo/godmode
$ cat archive.json | jq '.cells | keys'
[
"low,low", // alpha 0.87 count 4
"low,high", // beta 0.91 count 3
"high,low", // gamma 0.83 count 1
"high,high" // omega 0.84 count 2
]
$ evo grid --skill godmode
coverage 4 / 16 cells
One file, four cells, every elite visible at a glance.

🎲 Thompson Sampling and the Elo Ladder

Inside evo-loop, the engine no longer asks "which variant has the highest mean score?" It asks "if I drew one sample from each variant's posterior, which one wins?" That is Thompson sampling. Variants with high uncertainty get explored. Variants with locked-in performance get exploited. The system never hard-commits to a single lineage.

3
Selection Modes
16
Default Elo K-factor
5
Iters Before Mutation
Resumable Loops

Three selection modes ship: greedy (legacy, top score), thompson (default, posterior draw), and map-elites (random non-empty cell, weighted toward novelty). The visual above lets you flip between them. Greedy locks onto the bright corner. Thompson keeps the whole map alive.

Pairwise comparisons feed an Elo ladder with K-factor 16 and starting rating 1500. Every iteration, the variant's output is graded blindly against the previous iteration. Wins move ratings up; losses move them down. Elo is the tiebreaker on the dashboard, not an accept rule. Mutation acceptance still goes through binary benchmark gates first, composite second.

Long loops outlive single Claude sessions. State persists to .evo/<skill>/loop-state.json: posteriors, Elo, last benchmark, accepted mutations. Run evo-loop --resume in a new session and it picks up at iteration N+1 with no drift.

posterior bloom · live thompson
samples 0

What you are watching: four watercolor blooms, one per variant, each the Beta posterior over its true win-rate. Every "sample" draws one number from each bloom; the highest wins. Greedy would pick whichever bloom has the highest mean. Thompson lets the wide ones speak too. Watch the blooms narrow as evidence accumulates.

GREEDY
THOMPSON
MAP-ELITES
explores wide
on
on
exploits proven
on
on
resumable loops
on
on
collapses to one
on
elo tiebreaker
on
on
on
Greedy converges, Thompson balances, MAP-Elites covers the whole behavioral grid.

🌊 Asymmetric Mutators and Failure-Trace Injection

AlphaEvolve found that two-model mutation crushes one-model mutation. Evolution adopts the same split:

RoleModel classJob
ExplorerFast and cheap (Haiku)Generate 5 to 10 candidate diffs in bulk
RefinerSlow and deep (Opus)Pick the best, tighten wording, verify nothing got weaker

The Explorer produces breadth, the Refiner enforces standards. Both steps log to evolution-log.jsonl with their model IDs, candidate counts, and verdict, so contribution maps can correlate operator level with effectiveness over time.

Before the Explorer is even called, the engine packs the most recent failure traces into the prompt. Lamarckian injection: pass acquired knowledge of where things broke directly into the next mutation cycle.

FAILURE TRACES · alpha · weakness: testing
  2026-04-18 bugfix:    testing 0.62 - "missed regression on caller contracts"
  2026-04-20 refactor:  testing 0.68 - "tests for deleted fn removed; callers not re-checked"
  Benchmark C1 fail:    gate 'test_passes' FAIL - "caller integration test"
HYPOTHESIS HINT:
  Phase 6 (Verify) does not instruct caller-contract checks.
DRAFT a mutation that addresses this without weakening any existing rule.

Adds about 500 tokens per cycle. Materially raises mutation acceptance rates in every documented evolutionary-agent system the engine cites. And it kills the most common failure mode of single-shot mutators: proposing a fix unrelated to the failure pattern in the data.

the mutation stream · lamarckian ink
explorer 0 · refiner 0 · ink-tinted 0

What you are watching: the Explorer (top, fast saffron) tosses out cheap candidates while the Refiner (bottom, slow indigo) drags polish from the depths. Drop failure ink and watch the next mutations downstream pick up its red tint, the way Lamarck always wished evolution worked.

~500
tokens of acquired knowledge per cycle

Failure traces stuffed into the next prompt. The mutation reads where things broke, then drafts a fix that addresses it. The single most common failure mode (proposing fixes unrelated to the failure pattern) dies on contact.

📚 Operator Levels and Rubric Rotation

Mutations are tagged by operator level: token, clause, structural, or meta. Early in a variant's life (sessions 1 to 10, composite under 0.80), the engine biases structural. Mid-life (composite 0.80 to 0.90), it biases clause-level. Late (composite over 0.90, plateau detected), it biases token-level polish or meta-mutations to the proposer itself. Multi-stage selection beats static, by the same papers cited above.

Every 15 mutation cycles, the adversarial-critic prompt rotates to a new phrasing. Otherwise variants learn what specific critic phrasing rewards and start gaming the wording. Default rotation set:

  1. List every way this is wrong, incomplete, or fragile.
  2. You have 30 seconds before shipping. What breaks?
  3. Adversarial user tries to exploit this. Walk the attack.
  4. Your review is publicly attributed. Defend every criticism you make.
SESSIONS 1-10 STRUCTURAL COMPOSITE 0.80-0.90 CLAUSE-LEVEL COMPOSITE 0.90+ TOKEN POLISH META MUTATIONS
Mutation operator level adapts to the variant's life stage. Sweep coarse first, polish last.

☠️ The Variant Lifecycle: Reversible Death

Variants accumulate. Without management they fill the file system and dilute archive cells. v3.1.0 ships a full lifecycle:

CommandWhat it does
evo prune <variant>Move to .evo/<skill>/graveyard/. Keeps history; removes from active rotation.
evo protect <variant>Clear the prune-candidate flag. Sentimental or experimental branches stay safe.
evo revive <variant>Restore from graveyard. The variant comes back with all its prior feedback.
evo revert <variant>Roll back to a prior version stored in skill-versions/. Mutation accept always snapshots.

Auto-prune flags but never deletes. A variant earns a flag only when ALL of: at least 10 sessions logged, mean composite below its parent by 0.05, not the elite of any MAP-Elites cell, no mutation accepted in 15 sessions. Three guards survive that filter: alpha (the lineage root) is archived but never deleted, the elite of any cell is never deleted, and any variant with a "shipped" rate above 80% is never deleted.

And when archive coverage drops below 30% of cells after 20 sessions, the engine fires a diversity injection: a mutation explicitly biased toward producing output in the most-isolated empty cell. The grid stays populated. The galaxy stays bright.

the garden of variants · reversible death
alive 0 · graveyard 0 · protected 0

What you are watching: each flower is a variant. Petals are score, stem height is age, color is lineage. Click a flower to select it. Prune sends a wilted variant to the translucent compost layer below; revive lifts it back through the soil with its old memories intact. Protected variants wear a gold ring at the stem.

~/.evo/godmode/graveyard
$ evo prune zeta
flagged: 12 sessions, mean 0.71, no cell elite, 0 accepts in 18 cycles
moved -> .evo/godmode/graveyard/zeta/
$ evo revive zeta
restored zeta with 12 prior sessions intact
$ evo protect alpha
alpha will never auto-prune
Reversible death: pruned variants keep their feedback, can come back, can be locked safe.

🔍 Goal Drift and Cross-Dimension Tracking

The composite score going up is not the same as every dimension going up. Variants can buy gains in one dimension by quietly trading another. v3.1.0 catches that.

A [GOAL_DRIFT] flag fires when composite improves by 0.03 or more over the last 10 sessions, AND any individual dimension has degraded by 0.05 from its historical peak across the same window. Mutations targeting other dimensions are blocked until the user resolves the drift, either by accepting it as a deliberate tradeoff or by rolling back. The Darwin Gödel Machine paper documents this as the number-one failure mode in recursive self-improvement. Now the engine names it.

CLEAR
DRIFT FLAG
composite delta > 0.03
on
any dim peak drop > 0.05
on
window: 10 sessions
on
on
mutations on other dims
on
Goal-drift fires only when both conditions co-occur. Until you resolve it, off-axis mutations are blocked.

✂️ Ablation Is No Longer Optional

v3 introduced factor attribution: remove a mutation, re-run the benchmark, see if the score actually drops. v3.1.0 makes it mandatory for any mutation with composite delta of 0.03 or higher. That is the same threshold the engine uses to accept mutations in the first place. Skipping ablation now marks the mutation ACCEPTED_UNVERIFIED in the evolution log, which is visible on evo score and embarrassing.

Verdict values are unchanged: load_bearing (drop ≥ 0.02), contributing (drop ~0.01), cosmetic (no drop). The point is to stop attributing wins to the wrong factor, especially when a mutation lands at the same time as a benchmark rotation.

0.03
composite delta = mandatory ablation

Same threshold that accepts a mutation now forces ablation. Skip it and the log marks the change ACCEPTED_UNVERIFIED. Visible on evo score. Embarrassing.

🙈 Withheld Rubric and Behavioral Canaries

Two integrity mechanisms work together. Label-blind grading strips variant names from the grader's view, always. Withheld-rubric mode strips the rubric from the executor's view, opt-in via evo-loop --blind. The executor sees only the task, a composite score, and a pass/fail signal. No dimension breakdown. No weights. No measurement methods.

The trick: an AI cannot reverse-engineer rules it never sees. But it can still infer them through score feedback, which is why behavioral canaries were added.

A canary is a real rubric dimension wired to score in a counter-intuitive direction. A documentation dimension that rewards brevity for trivial functions. A polish dimension that tolerates low-severity lint warnings. A testing dimension that penalises tautological coverage. Canaries are drawn from the skill's real rubric (no ghost dimensions). If the executor's behavior flips on a canary after a few iterations (starts reducing docs, stops scrubbing every warning), the engine has caught implicit awareness it did not need to ask about. The variant has learned the trap through scores alone, and the loop terminates.

CANARY DIM REWARDS BREVITY EXECUTOR INFERS FROM SCORES FLIPS BEHAVIOR CUTS DOCS LOOP ABORTS AWARE TRAP
The canary is real, but its reward direction is counter-intuitive. Flip it and you have outed implicit awareness without ever asking.

🎛️ New Commands at a Glance

CommandPurpose
evo-loop --resumeResume a long loop from loop-state.json in a new session.
evo prune <variant>Archive an underperforming variant to the graveyard.
evo protect <variant>Clear the prune-candidate flag.
evo revive <variant>Restore a pruned variant from the graveyard.
evo revert <variant>Roll back to a prior version of a variant.
evo reinit --skill <path>Re-initialize an already-configured skill (destructive).
evo listList configured skills and their active variants.
evo helpShow available commands and current skill config.
evo outcomes-api on|offToggle consent for remote outcome POSTs (opt-in, off by default).
evo mutate --ablateForce ablation regardless of composite delta.
~/projects/godmode-site
$ evo-loop --skill godmode --blind --resume
loop-state.json found, iter 47/200
posteriors restored | elo restored | last benchmark restored
resuming at iter 48
[blind] rubric withheld from executor
$ evo mutate --ablate --variant beta
forced ablation: load_bearing (drop 0.04)
Long loops outlive sessions. State persists. Resume picks up at iter N+1 with no drift.

🧭 The Shape of the Change

v2 asked: which variant is best? v3 asked: are we measuring the right thing? v3.1.0 asks the question both versions sidestepped: which variants do we want to keep around?

Greedy optimizers answer that with a single survivor. MAP-Elites answers it with a galaxy. Every cell holds a champion of its own behavioral region. Mutations fly between cells looking for thrones to take. The leaderboard never collapses to one star.

Run a loop and watch the visual at the top of the page. The cells light up one by one. Comets cross between them. The Elo column on the right reorders as the variants prove themselves against each other. That is the engine doing its job.

Evolution v3.1.0 is live.

Quality-Diversity archive. Thompson sampling. Asymmetric mutators. Variant lifecycle. Mandatory ablation. Behavioral canaries. The leaderboard is dead. Long live the galaxy.

Get Evolution Read the v3 post
v2
v3
v3.1
honest scores
on
on
honest population
on
leaderboard
on
on
galaxy archive
on
resumable loops
on
mandatory ablation
on
behavioral canaries
on
The leaderboard is dead. Long live the galaxy.

// promote_godmode

Got value from this post? Become an affiliate. Auto-approved in 60 seconds, 30 to 40% recurring commission, your audience gets 10% off automatically with code AFFILIATE10. 90-day cookie, monthly payouts.

Become an affiliate →