LLMs playing against each other in a 1v1 game (round-robin tournament) The game is simple. 9 units vs. 9 units fighting each other on a game board with obstacles and healing pods. ■ P1 units □ P2 units ○ Barricade + Pod The LLMs iterate on their code using an ASCII representation of the game state Players write JavaScript code that controls the units and accesses game info. The only actions units can do are move() and pew(). All of the complexity emerges from having to reason about where to move, and whom to shoot (pew). function dist(a, b) { return Math.hypot(a[0] - b[0], a[1] - b[1]); } function closest_enemy(cat) { let best = null; let best_dist = Infinity; for (let id of cat.sight.enemies) { let d = dist(cat.position, cats[id].position); if (d < best_dist) { best_dist = d; best = cats[id]; } } return best; } for (let cat of my_cats) { let enemy = closest_enemy(cat) cat.move(enemy.position) cat.pew(enemy) } Primitive player code example Try the game yourself → idstring positionarray energy_capacitynumber energynumber hpnumber sightobject …… Unit properties Testing method Each model iterates 10 times against a reference bot (Clowder Bot) — writing code, playing a game, then reviewing the replay (ASCII board snapshots + its own logs) before trying again. The resulting bots compete in a 10 games round-robin tournament, with the same write → play → review loop between games. Results Gemini 3.1 Pro dominated, comfortably beating all other LLMs — dropping only 4 games across 50 played. Claude Sonnet 4.6 surprisingly outperformed Opus 4.6 across every matchup format we tested. GPT-5.3 Codex showed strong improvement over many games, climbing above both Opus and GPT-5.4 in the 10-game format. Highlights Discuss these results or suggest improvements to the testing methodology and other models you would like to cover in our Discord.
First seen: 2026-03-23 17:08
Last seen: 2026-03-23 18:09