The fix for a segfault that never shipped

https://news.ycombinator.com/rss Hits: 1
Summary

At Recall.ai, we run an unusual workload. We record millions of hours of meetings every month. Each of these meetings generates a large amount of audio and video we need to reliably capture and analyze. The audio we capture comes in a variety of shapes and sizes, an assortment of codecs, channels, sample rates, interleaving, and error-correction schemes. We normalize all of those into one consistent format that is universally playable. We launch 18 million of EC2 instances every month, each of these instances is a “meeting bot”, which joins a video call and captures the data in real-time. Extremely rarely, about 1 in 36 million bots would abruptly crash deep in library code of our media pipeline. Unlike most web servers, a meeting bot instance is extremely stateful and very hard to replace, a fatal error like this means the data is irrecoverably lost - forever. Even a one-in-tens-of-millions failure rate was unacceptable. This is the story of how we tracked down the rare bug, went to significant effort to reproduce it, identify the root cause and fix it. TL;DR We encountered an extremely rare segfault in the AAC encoder we were using and root caused it to a bug in the fixed-point math C code. We found this was patched over a decade ago but the fix was never shipped to downstream consumers. Rather than fixing the bug we replaced the library with a modern AAC encoder which did not experience these crashes. Capturing the elusive core dump This crash was so rare that reproducing it locally wasn’t feasible. Instead we opted to capture the program state from production, in the rare event this happens. Our first clue was the process 139 exit code (139 = 128 + 11). Signal 11 corresponds to SIGSEGV, a segmentation fault or segfault for short, this occurs when a program tries to access memory it is not supposed to. This mistake is bad enough that the execution of the program should halt immediately. So how do we determine the cause of a SIGSEGV? The answer is often a core dum...

First seen: 2026-01-24 14:51

Last seen: 2026-01-24 14:51