OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

https://news.ycombinator.com/rss Hits: 4
Summary

Frontier AI models have become excellent at writing functions, but can they actually debug production systems? To fix outages, you first need to see what鈥檚 happening. In a microservices world, this means producing structured events that track a single request as it hops from service to service. We asked 14 models to add distributed traces to existing codebases, using the standard method: OpenTelemetry instrumentation. We picked tasks that would be easy for a Site Reliability Engineer (SRE). Go to OTelBench website for complete charts. All models struggle with OpenTelemetry. Even the best model, Claude 4.5 Opus, succeeded only 29% of the time, and GPT 5.2 was similar at 26%. Surprisingly, Gemini 3 Pro has no edge over Gemini 3 Flash, which scored 19%. We are releasing OTelBench as an open-source benchmark, with all tasks in QuesmaOrg/otel-bench. We use the Harbor framework (by the creators of TerminalBench), so you can easily run it yourself to reproduce results, test new models, or create benchmarks for your own use cases (we welcome contributions!). Background: What is distributed tracing? When an app runs on a single machine, you can often trace an error by scrolling through a log file. But when it runs across 50 microservices, that single request gets scattered into a chaotic firehose of disconnected events. Distributed tracing solves this by linking them back together, allowing you to follow a user action, like clicking Login, as it jumps from the API Gateway, to the Auth Service, to the Database, and back. User Login API Gateway Auth Database User Svc Trace Waterfall TraceID: abc123def456 Distributed tracing links a user action, like a Login button click, to every underlying microservice call. To make this work, you need instrumentation. This is code that you add to your app to: Start a trace when a request comes in. Pass the TraceID (context) when your app calls another service. Send the data to a backend so you can see the graph. OpenTelemetry (OTel) is the i...

First seen: 2026-01-29 16:33

Last seen: 2026-01-29 19:34