Hey all! I'm Jim, and I do system-y things at Bluesky. I'm here to give you some details about what happened on Monday of this week that caused Bluesky to go down intermittently for ~1/2 our users for about 8 hours.First, I'd like to apologize to our users for the interruption in service. This is easily the worst outage we've seen in my time here. It's just not acceptable.The ProblemThe issue actually started earlier that weekend. Here's the Bluesky AppView's requests chart for the days leading up to the really bad day (Monday):The yellow/green isn't important, but those dips are super nasty! They represent real user-facing downtime. Ouch!We got a page on Saturday April 4. I took a look, thinking it was likely a transit issue. We have pretty extensive network monitoring, and it all looked clear.I did, however, notice a spike in log lines like this in our AppView data backend (called the "data plane"):{ "time": "2026-04-03T22:16:07.944910324Z", "level": "ERROR", "msg": "failed to set post cache item", "uri": "at://did:plc:mhvcx2z27zq2jtb3i7f5beb7/app.bsky.feed.post/3mim4uloar22m", "error": "dial tcp 127.32.0.1:0->127.0.0.1:11211: bind: address already in use" }The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.The Root CauseIt took a long time to find the actual issue due to subpar observability. We generally have excellent monitoring in the data plane, but it does assume that each request to it is small and doesn't do much work.This particular RPC (GetPostRecord) takes a batch of post URIs, and looks them all up in memcached, then scylla upon cache miss. What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between ...
First seen: 2026-04-10 16:59
Last seen: 2026-04-10 18:00