An interesting file system feature that I came across a few years ago is sparse files. In short, many file systems allow you to create a logical file with "empty" (fully zeroed) blocks that are not physically backed until they get written to. Partially copying from ArchWiki, the behavior looks like this: The file starts at 0 physical bytes on disk despite being logically 512MB. Then, after writing some non-zero bytes at a 16MB offset, it physically allocates a single block (4KB). The file system is maintaining metadata on which blocks of the file are physically represented on disk and which ones are not. To normal readers of the file, it's transparent -- the sparsity is managed completely by the file system.At Amplitude, we found a cool use case for sparse files. All of the data is stored durably in cold storage (Amazon S3) in a columnar data format used for analytics queries. But it's inefficient and costly to fetch it from S3 every time, so the data gets cached on local NVMe SSDs (e.g. from the r7gd instance class). These local SSDs are more than ten times as expensive as cold storage, though, so you need a good strategy for deciding how and what to cache. To understand why sparse files are a good option for this, let's revisit columnar data formats briefly.One of the observations about analytics queries that makes columnar data formats so effective is the fact that usually only a very small subset of columns (5-10 out of potentially thousands) are used on any particular query (and, to some extent, even across many queries). The data being stored in a columnar fashion means that each of these columns is a contiguous range inside the file, making it much faster to read. The contiguous ranges are also important in the context of our local caching -- we're not trying to pick out small pieces scattered across the file.Setting sparse files aside for a moment, we originally had two different approaches for doing this caching:The most naive strategy is caching entire fil...
First seen: 2026-02-01 04:43
Last seen: 2026-02-01 06:43