Part 1: Designing the architecture Part 1 of two. Part 2 covers heartbeats, cancellation, approval policies, observability, error handling, and developer isolation. What we’re building Our AI agents need access to customer documentation, which can live in Confluence, SharePoint, Google Drive, Salesforce Knowledge, or any of 20+ other platforms. Getting that documentation into a searchable state means crawling, extracting, chunking, embedding, and storing it across Supabase, TurboPuffer, and Elasticsearch. For small sources, a simple batch job would work. For large ones, with hundreds of thousands of documents and multi-hour processing times, we needed something more resilient. We built the ingestion pipeline on Temporal, and this post walks through the architecture. The problem The range of source sizes is wild. A small sitemap is dozens of pages. A large customer’s knowledge base can be hundreds of thousands of documents. A single electronics datasheet might run to thousands of pages of specs, compliance data, and circuit diagrams. The pipeline looks simple on paper: Crawl source → Download → Extract text → Chunk → Embed → Store in DB(s) In practice, it needs to be: Durable: runs can take hours. A crash at hour four shouldn’t restart from scratch. Stateful: you need to track which documents succeeded, failed, or got skipped, and where you left off. Concurrency-controlled: downstream APIs have rate limits. Unbounded fan-out makes things slower, not faster. Observable: when you’re processing thousands of documents across distributed workers, you need to trace failures back to individual documents. Approval-gated: before freshly ingested data goes live, someone should review what was indexed. We evaluated a few orchestration options and picked Temporal. The bake-off is a different post. This one is about the architecture patterns that made Temporal work at scale, and the design goal behind them: a 200K-document run should use the same orchestration model as a 2K-docum...
First seen: 2026-05-26 20:39
Last seen: 2026-05-27 05:44