Scaffolding to Superhuman: How Curriculum Learning Solved 2048 and Tetris

https://news.ycombinator.com/rss Hits: 19
Summary

Training gaming agents is an addictive game. A game of sleepless nights, grinds, explorations, sweeps, and prayers. PufferLib allows anyone with a gaming computer to play the RL game, but getting from “pretty good” to “superhuman” requires tweaking every lever, repeatedly. This is the story of how I trained agents that beat massive (few-TB) search-based solutions on 2048 using a 15MB policy trained for 75 minutes and discovered that bugs can be features in Tetris. TLDR? PufferLib, Pareto sweeps, and curriculum learning. Speed and Iteration PufferLib’s C-based environments run at 1M+ steps per second per CPU core. Fast enough to solve Breakout in under one minute. It also comes with advanced RL upgrades like optimized vectorized environments, LSTM, Muon, and Protein: a cost-aware hyperparameter sweep framework. Since 1B-step training takes minutes, RL transforms from “YOLO and pray” into systematic search, enabling hundreds of hyperparameter sweeps in hours rather than days. All training ran on two high-end gaming desktops with single RTX 4090s. Compute was sponsored by Puffer.ai, thanks! The Recipe Augment observations: Give the policy the information it needs. Tweak rewards: Shape the learning signal and adjust weights. Design curriculum: Control what the agent experiences and when. Network scaling comes last. Only after exhausting observations, rewards, and curriculum should you scale up. Larger networks make training slower; nail the obs and reward first. Once you do scale, the increased capacity may (or may not) reach new heights and reveal new insights, kicking off a fresh iteration cycle. Sweep methodology: I ran 200 sweeps, starting broad and narrowing to fine-tune. Protein samples from the cost-outcome Pareto front, using small experiments to find optimal hyperparameters before committing to longer runs. 2048: Beating the Few-TB Endgame Table 2048 strikes a unique balance between simplicity and complexity. The rules are trivial: merge the same tiles to reach...

First seen: 2025-12-31 16:08

Last seen: 2026-01-01 10:10