evolution – The Domains of the Emperor Owl

Now that the evolutionary process to grow game definitions is progressing at a steady pace in my app, named LudoForge, I fed the architectural docs to ChatGPT so that it would write a good explanation on how it works. It may be interesting to those curious about how complex systems can grow organically through an evolutionary algorithm that mimics biological evolution.

Teaching a computer to invent tabletop games (and occasionally rediscover the classics)

If you squint, evolving game designs is a lot like evolving creatures: you start with a messy ecosystem of “mostly viable” little organisms, you test which ones can survive in their environment, you keep the best survivors—but you also keep a diverse set of survivors so the whole population doesn’t collapse into one boring species.

In our system, the “organisms” are game definitions written in a small, strict game DSL (a structured way to describe rules: players, state variables, actions, effects, win/lose conditions, turn order, and so on). Each candidate game definition is wrapped in a genome: basically an ID plus the full definition.

From there, the evolutionary loop repeats: seed → simulate → score → place into niches → mutate elites → repeat.

1) Seeding: where the first games come from

Evolution needs a starting population. We can generate seeds automatically, or import them from disk, or mix both approaches. The important bit isn’t “are the seeds good?”—it’s “are they valid and diverse enough to start exploring?”

So seeds must pass two kinds of checks before they’re even allowed into the ecosystem:

Schema validation: does the JSON structure match the DSL’s required shape?
Semantic validation: does it make sense as a playable ruleset (no broken references, impossible requirements, etc.)?

And there’s a third, subtle filter: when we place games into “niches” (more on that below), seeds that land only in junk bins like unknown/under/over are rejected during seed generation, because they don’t help us cover the design space.

Think of this as: we don’t just want “a bunch of seeds,” we want seeds scattered across different climates so evolution has many directions to run in.

2) Playtesting at machine speed: simulation as the “environment”

A human can’t playtest 10,000 games a day. A simulation engine can.

For every candidate game, we run automated playthroughs using AI agents (simple ones like random and greedy are enough to expose lots of structural problems). The engine repeatedly:

Lists the legal moves available right now
Checks termination (win/lose/draw conditions, cutoffs, loop detection)
Lets an agent pick an action (with concrete targets if needed)
Applies costs and effects, recording what happened
Advances the turn/phase according to the game’s turn scheduler

Crucially: when a complex game definition tries to do something illegal (like decrementing below a minimum, or targeting something that doesn’t exist), the engine records skipped effects/triggers instead of crashing, so the system can observe “this design is broken in these ways” rather than just failing outright.

This is the equivalent of an organism interacting with the world and leaving tracks: “it tried to fly, but its wings didn’t work.”

3) Turning playthroughs into numbers: metrics, degeneracy, and fitness

After simulation, we compute a set of analytics that act like proxies for design quality—things like:

Agency: did players have meaningful choices, or were they railroaded?
Strategic depth (proxy): how large is the typical decision space?
Variety: do many different actions get used, or does one dominate?
Interaction rate: are players affecting each other or only themselves?
Structural complexity: is this a tiny toy, or something richer?

These are not “fun detectors.” They’re sensors.

Then we run degeneracy detection: filters that catch the classic failure modes of randomly-generated rulesets:

infinite loops / repeated states
non-terminating games (hits max turns/steps)
games with no real choices
trivial wins, dominant actions, excessive skipped effects, etc.

Some degeneracy flags cause an outright reject (hard gate), others apply a penalty (soft pressure), and too many penalties at once can also trigger rejection.

Finally, all of this becomes a feature vector, and we compute an overall fitness score—the number evolution tries to increase.

4) “Growing in niches”: why we don’t keep only the top 1%

If you only keep the single highest-scoring game each generation, you get premature convergence: the population collapses into one design family and stops surprising you.

Instead, we use MAP-Elites, which you can picture as a big grid of “design neighborhoods.” Each neighborhood is defined by a few chosen descriptors (think: agency bucket, variety bucket, etc.). Each candidate game gets “binned” into a niche based on its descriptor values, and then it competes only with others in that same niche.

Each niche keeps its best resident (the “elite”).
Over time, the map fills with many different elites: fast games, slow games, chaotic games, skillful games, high-interaction games, and so on.

This is how you get a museum of interesting survivors, not one monoculture.

5) Reproduction: mutation (and why mutation is structured, not random noise)

Once we have elites across niches, we generate the next generation by mutating them.

Mutation operators aren’t “flip random bits.” They are rule-aware edits that make plausible changes to game structure, such as:

tweak a number (thresholds, magnitudes)
add/remove/duplicate actions (with safety guards so you can’t delete the last action)
add/remove variables (while rewriting dangling references)
change turn schedulers (round-robin ↔ simultaneous ↔ priority-based, etc.)
add triggers and conditional effects
modify termination rules (win/lose/draw conditions)

The key is: the operator library is rich enough to explore mechanics, not just parameters.

Mutation retries (because many mutations are duds)

Some mutations do nothing (“no-op”), or produce something that can’t be repaired. The runner will retry with a different operator a few times; if it still can’t produce a productive mutation, it falls back to keeping the parent for that offspring slot.

This keeps evolution moving without pretending every random change is meaningful.

6) Repair, rejection reasons, and staying honest about failure

After mutation, we may run repair (optional), then validation and safety gates. If a candidate fails, it’s not just dropped silently—the system classifies why it failed:

repair failure
validation failure
safety failure
evaluation error
evaluation returned null fitness/descriptors

And it persists these outcomes for observability and debugging.

This matters because “evolution” is only useful if you can tell whether the ecosystem is healthy—or if you’ve started breeding nonsense.

7) Adaptive evolution: learning which mutations are actually useful

Not all mutation operators are created equal. Some will be reliably productive; others will mostly create broken genomes.

So the runner tracks per-operator telemetry every generation: attempts, no-ops, repair failures, rejection counts, and how often an operator actually helped fill a new niche or improve an elite. evolution-runner

Those stats feed into adaptive operator weighting, so the system gradually shifts its mutation choices toward what’s working in the current region of the design space—without hardcoding that “operator X is always good.”

8) Optional superpower: motif mining (stealing good patterns from winners)

Sometimes an evolved game contains a little “mechanical phrase” that’s doing real work—like a neat resource exchange loop, or a repeating pattern of effects that creates tension.

When motif mining is enabled, we:

select elites (top per niche + global best)
re-simulate them to extract trajectories
mine repeated effect sequences (“motifs”)
convert those motifs back into DSL effects
feed them into a special mutation operator that can inject those motifs into new games

That’s evolution discovering a useful mechanic, then turning it into reusable genetic material.

9) Human taste enters the loop (without turning it into manual curation)

Metrics are helpful, but “fun” is subjective. So we can add human feedback:

ratings (“how good is this?”)
pairwise comparisons (“A or B?”)

Rather than asking the human to judge random games, the system uses active learning to choose the most informative comparisons—especially cases where its preference model is uncertain, and it tries to include underrepresented niches so taste is learned across the map.

Under the hood, the preference model is an ensemble trained online (so it can update continuously) and its uncertainty controls how much feedback to request per generation (adaptive budget).

Fitness can then blend:

“objective-ish” signals (metrics/degeneracy)
“human preference” signals

So the system doesn’t just breed games that are non-broken—it breeds games that align with what you actually enjoy.

10) Why this can rediscover known games and invent new ones

If your DSL is expressive enough to describe the rules of an existing game, then in principle there exists a genome that encodes it. Evolution doesn’t need to “know” the game—it only needs:

Search operators that can reach that region of the rulespace (structural mutations, not just numeric tweaks)
Selection pressure that rewards the behaviors that make that game work (choice, balance, interaction, clean termination, etc.)
Diversity preservation so the system keeps exploring many styles instead of collapsing early

Once those are in place, rediscovery becomes a side-effect of searching a huge space under the right constraints. And invention is what happens when evolution stumbles into combinations nobody tried on purpose—then keeps them because the ecosystem rewards them.

The simplest mental model

If you want the non-technical version in one breath:

We generate lots of small rulesets, machine-play them thousands of times, score them for “does this behave like a real game?”, sort the survivors into many different “design neighborhoods,” keep the best in each neighborhood, then make mutated children from those survivors—occasionally learning from human taste and reusing discovered mechanics—until the map fills with strong and varied games.

That’s the evolutionary process: not magic, not random, but a relentless loop of variation + playtesting + selection + diversity.

	LudoForge #4 –… on LudoForge #3
	LudoForge #3 –… on LudoForge #2
	LudoForge #2 –… on LudoForge #1
	Living Narrative Eng… on Living Narrative Engine #…
	Living Narrative Eng… on Living Narrative Engine #…