Computational complexity of schema-guided document extraction

https://news.ycombinator.com/rss Hits: 2
Summary

When we started building Pulse, we assumed that structured outputs had solved the document extraction problem. Define a JSON schema, point an LLM at your documents, and get clean data back. Every major API provider now supports it. OpenAI, Anthropic, Google, and a growing ecosystem of open source tools like Outlines and XGrammar have made constrained generation accessible to anyone with an API key.So our research team internally ran some evals - used a relatively simple document layout with bank statements and invoices, defined a schema, and watched it work. Then we tried it on ten thousand more complex documents with alternate currency markers, faxed documents with unwired tables, nested line items, conditional fields, and variable-length arrays. That's when things got interesting.The short version: enforcing structured outputs during generation is not free, and for the kinds of complex, nested schemas that real document extraction requires, naive approaches either blow up in compute or silently degrade extraction quality.How Constrained Decoding Actually WorksAt each generation step, the model produces a probability distribution over its entire vocabulary (128,000+ tokens in OSS models like Llama). Constrained decoding works by masking out tokens that would violate the schema, setting their probabilities to zero before sampling. The model can only generate tokens that keep the output on a valid path through the grammar.For simple constraints like regular expressions, this is efficient. The Outlines library showed you can precompute a finite state machine (FSM) from a regex and use it to generate token masks in O(1) time per step. For regular languages, the set of valid next tokens depends only on the current FSM state, not on the history of how you got there.While a specific JSON schema can be mapped to a regular language, the recursive nature of nested objects and arrays makes it behave more like a context-free grammar in terms of parser complexity.The State Expl...

First seen: 2026-01-12 18:02

Last seen: 2026-01-12 19:02