Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months. I run Weird Gloop, which hosts some of the biggest video game wikis ever, like Minecraft, OSRS and League. Over the last 3 years, we’ve had to spend more and more of our time fighting with this bot traffic that is spiky, disproportionately expensive, and getting harder to distinguish from humans. If we weren’t constantly mitigating the bots, they would use ~10x more of our compute resources than everything else put together - even though that “everything else” includes tens of millions of (human) pageviews and tens of thousands of edits a day. Everyone who runs wikis is dealing with the exact same problem. The Wikimedia Foundation has a post about it impacting operations, every major wiki farm has had varying degrees of service outages, and some smaller independent wikis have been knocked completely offline. Overall, I’d guess that about 95% of all server issues in the wiki ecosystem this year have been caused by bad scrapers. Every wiki sysadmin I’ve talked to is dealing with these specific problems: The scrapers are pretending to be human visitors, and getting pretty good at it Most of the discussion I’ve seen about scrapers has focused on bots operated by the major AI companies (GPTBot, ClaudeBot, PerplexityBot, etc). Although these “official” bots have at times struggled to respect robots.txt, at least they usually properly identify themselves as bots in their User Agent string, which makes it really easy for a website operator to block them with Cloudflare, nginx, or any number of other techniques. The problem is that when webmasters started blocking AI scrapers based on User Agent, it created a massive incentive for bots to pretend to be human traffic, so as to avoid getting blocked. Thi...
First seen: 2026-05-21 04:51
Last seen: 2026-05-23 04:29