Two nights ago I was rolling around in bed trying to sleep when a notion came into my head, one that has returned from time to time: some of the most flow-like fun I’ve ever had was playing tabletop games. I’m a systems builder by nature, and I love to solve problems with a variety of tools. Tabletop games are complex problems to solve with specific series of tools. My favorite tabletop game is Arkham Horror LCG, although I’ve loved many more like Terraforming Mars, Ark Nova, Baseball Highlights: 2045, Core Worlds, Imperium, Labyrinth, Renegade… But none of them fully captured me. Like some potential game exists that has exactly every feature my brain yearns for, but that game doesn’t exist. I’ve cyclically thought that I should create that game, but I never know where to start. I don’t even know what exactly I want, other than knowing that what I’ve experienced isn’t enough.
These past few weeks I’ve been implementing extremely-complex analytics reports generators for my repository Living Narrative Engine. I was surprised to find out that it’s feasible to mathematically find gaps in extremely complex spaces (dozens of dimensions) as long as they’re mathematically defined. I guess Alicia was justified to be obsessed with math. So I started wondering: what makes a tabletop game good? Surely, the fun you have with it. Can “fun” be mathematically defined? Is it the agency you have? The strategic depth? The variety? If any of such metrics could be mathematically defined, then “fun” is a fitness score that combines them.
And what if you didn’t need to design the game yourself? If you can map a simulated game’s activity to metrics such as the agency per player, the strategic depth, the variety… Then you can evolve a population of game definitions in a way that, generation after generation, the “fun” score improves. If you can turn all game mechanics into primitives, the primitives will mutate in and prove their worth throughout the generations, composing coherent mechanics or even inventing new ones. Initially, a human may need to score game definition variants according to how “fun” the playthrough of those games were, but in the end that could be automated as well.
Because this is the era of Claude Code and Codex, I’ve already implemented the first version of the app. I’ve fed ChatGPT the architectural docs and told it to write a report. You can read it down below.
LudoForge: evolving tabletop games with a deterministic “taste loop”
I’m building LudoForge, a system that tries to answer a pretty blunt question:
What if we treated tabletop game design like search—simulate thousands of candidates, kill the broken ones fast, and let a human “taste model” steer evolution toward what’s actually fun?
Under the hood, it’s a seeded-population evolution loop: you start with a set of game definitions (genomes), run simulations, extract metrics, filter degeneracy, blend in learned human preferences, and then evolve the population using MAP-Elites and genetic operators. Then you repeat.
The big picture: the loop
LudoForge is structured as a pipeline with clean seams so each layer can be tested and swapped without turning the whole thing into spaghetti. The stages look like this: seed → evaluate → simulate → analytics → (optional) human feedback → fitness → MAP-Elites → (optional) mutate/crossover/repair → next generation. pipeline-overview
A key design choice: the core expects a seeded population. There’s no “magic generator” hidden inside that invents games from scratch. If you want a generator, you build it outside and feed it in. That keeps the engine honest and debuggable. Note by me after rereading this part of the report: this will change soon enough.
Games as genomes: a DSL that can be validated and repaired
Each candidate game is a genome: { id, definition }, where definition is a DSL game definition. Before any evaluation happens, the definition goes through schema + semantic validation—and optionally a repair pass if you enable repair operators. Invalid DSL gets rejected before it can contaminate simulation or preference learning.
Repair is deliberately conservative: it’s mostly “DSL safety” (e.g., clamp invalid variable initial values to bounds). Anything that’s “this game is technically valid but dumb/unplayable” is handled by simulation + degeneracy detection, not by sweeping edits that hide the real problem.
The simulation engine: deterministic playthroughs with real termination reasons
The simulation layer runs a single playthrough via runSimulation(config) (or wrapped via createSimulationEngine). It builds initial state from the definition, picks the active agent, lists legal actions, applies costs/effects/triggers, advances turns/phases, and records a trajectory of step snapshots and events.
It’s also built to fail safely:
- No legal actions → terminates as a draw with
terminationReason = "stalemate". - Max turns exceeded →
terminationReason = "max-turns"with an outcome computed in that cutoff mode. - Loop detection (optional hashing + repetition threshold) →
terminationReason = "loop-detected".
Most importantly: runs are reproducible. The RNG is a seeded 32-bit LCG, so identical seeds give identical behavior.
Metrics: cheap proxies first, expensive rollouts only when you ask
After simulation, LudoForge summarizes trajectories into analytics: step/turn counts, action frequencies, unique state counts, termination reasons, and sampled “key steps” that include legalActionCount.
From there it computes core metrics like:
- Agency (fraction of steps with >1 legal action)
- Strategic depth (average legal actions per step)
- Variety (action entropy proxy)
- Pacing tension (steps per turn)
- Interaction rate (turn-taking proxy)
Extended metrics exist too, and some are intentionally opt-in because they’re expensive:
- Meaningful choice spread via per-action rollouts at sampled decision points
- Comeback potential via correlation between early advantage and final outcome
Here’s the honest stance: these metrics are not “fun”. They’re proxies. They become powerful when you combine them with learned human preference.
Degeneracy detection: kill the boring and the broken early
This is one of the parts I’m most stubborn about. Evolution will happily optimize garbage if you let it.
So LudoForge explicitly detects degeneracy patterns like:
- loops / non-termination
- stalemates
- forced-move and no-choice games
- dominant-action spam
- trivial wins metrics-and-fitness
By default, those flags can reject candidates outright, and degeneracy flags also become part of the feature vector so the system can learn to avoid them even when they slip through.
Human feedback: turning taste into a model
Metrics get you a feature vector. Humans supply the missing ingredient: taste.
LudoForge supports two feedback modes:
- Ratings (1–5) with optional tags and rationale
- Pairwise comparisons (A/B/Tie) with optional tags and rationale
Pairwise comparisons are the main signal: they’re cleaner than ratings and train a preference model using a logistic/Bradley–Terry style update. Ratings still matter, but they’re weighted lower by default.
There’s also active learning: it selects comparison pairs where the model is most uncertain (predicted preference closest to 0.5), while reserving slots to ensure underrepresented MAP-Elites niches get surfaced. That keeps your feedback from collapsing into “I only ever see one genre of game.”
Fitness: blending objective proxies, diversity pressure, and learned preference
Fitness isn’t a single magic number pulled from the void. It’s a blend:
- Base composite score from metrics (weighted sum/objectives)
- Diversity contribution (pressure toward exploring niches)
- Preference contribution from the learned preference model (centered/capped, with bootstrap limits early on)
Feature vectors are keyed by metric id (not positional arrays), which matters a lot: adding a new metric doesn’t silently scramble your model weights. Renaming metrics, though, becomes a migration event (and that’s correct—you should feel that pain explicitly).
Evolution: MAP-Elites + mutation/crossover that respect DSL validity
Instead of selecting “top N” and converging into a monoculture, LudoForge uses MAP-Elites: it bins candidates into descriptor niches and keeps the best elite per niche.
Descriptor binning is explicit and deterministic (normalize → floor into bin count; clamp to range), and niche ids serialize coordinates like descriptorId:bin|....
Then you can evolve elites with genetic operators:
- Mutations like numeric tweaks, boolean toggles, enum cycling, duplicating/removing actions, nudging effect magnitudes, adding/removing phases, rewriting token/zone references safely, etc.
- Crossover via subtree swaps of
state.variablesoractions, followed by DSL re-validation.
Optional “shortlisting” exists too: it picks a diversified subset of elites for human review using a max-min distance heuristic over descriptor coordinates.
What’s already proven (and what isn’t yet)
This isn’t vaporware; the end-to-end tests already prove key behaviors like:
- the ordered phases of the pipeline
- invalid DSL rejection before evaluation
- safety cutoffs (
max-turns) and deterministic seeded outputs - human prompt loops and legality enforcement
- deterministic state transitions
- MAP-Elites producing stable ids
- active learning selection behavior
- mutation + repair at scale, including crossover
And there are explicitly documented gaps—like extended metrics aggregation and worker-thread batch simulations.
The point of LudoForge
I’m not trying to build a “game designer replacement.” I’m building a design pressure cooker:
- Simulate hard
- Reject degeneracy ruthlessly
- Measure what you can
- Ask humans the right questions
- Let evolution explore breadth, not just a single hill
If you’re into procedural design, evolutionary search, or just enjoy the idea of treating “fun” as something you can iteratively approximate with a human-in-the-loop model, that’s what this project is for.












You must be logged in to post a comment.