How our protagonist discovered that a key service that powers our support was absurdly vulnerable to overload, and what we did to fix it. Part of our support infrastructure at work is an in-memory datastore, that allows us to query our outstanding support work over various dimensions, such as work type, whether it's been put on hold for some reason, etc. It's functionally equivalent to a single table in an SQL database, where you have a single dataset, boolean filters and configurable sorting. At work, we have an in-memory datastore that powers part of our support infrastructure. Its kind of analgous to having bitmap filters with post-hoc filtering, so any use of sort/limit will sort the entire result set. And the key part here, is that the result sets can be large enough that sorts can take one or two seconds. And for a bit of context, this service deployment wasn't autoscaled at the time, and upstream services will retry failed requests. Sometimes after a relatively short timeout. Which is fun. So, one day, this service had more query load than it can handle; and because of the inelasticity, it got overloaded, and queries started to take way longer (like, up to a minute vs. a typical time of up to 1-2s). Unfortunately, because this was an incident, and sometimes the panic sets in, one of my theories was that memory had gotten slower. Which of course was absurd, but under time presssure, incident brain can be very real. However, as earlier foreshadowed, this service had simply became overloaded, so we not only had slightly higher than average demand, but also failure demand from retries. Most of the time in a Go service, we pass around a context, so that when the caller gives up on us, we can cancel the operation, short-circuit and bail early. However, when we were able to get a cpu profile and take a look, the vast majority of the CPU time was taken up in the sort phase of the query. In go, none of the sort functions support cancellation (reasonably so, as normall...
First seen: 2026-05-23 10:34
Last seen: 2026-05-24 23:01