Deduplication in Action
Content-Defined Chunking, Part 3 February 16, 2026 | 19 minutes (3534 words)In Part 1, we explored why content-defined chunking exists and surveyed three algorithm families. In Part 2, we took a deep dive into FastCDC’s GEAR hash, normalized chunking, and how average byte targets affect chunk distribution. In this post, we bring the pieces together to see deduplication in action, examine where CDC is used in practice today and where it is not, and explore the cost tradeoffs that shape real-world systems.
Deduplication in Action
Imagine you are building a system to store files that change over time. Each new version is mostly the same as the last, with only a small edit here or there. As we saw in Part 1, storing a complete copy of every version wastes storage on identical content. Content-defined chunking offers a way out: split each version into chunks based on content, fingerprint each chunk, and only store chunks you have not seen before. But what does that actually look like in practice? The explorer below runs FastCDC on editable text. A small edit has already been saved to show deduplication at work: most chunks are recognized as duplicates and never stored again. Try making your own edits to see how content-defined boundaries respond.
Click "Save Version" after editing to see which chunks are new and which are shared. Hover over chunks to highlight them across views.
As the demo shows, even a small edit only produces a handful of new chunks while the rest are shared across versions. But how does this work under the hood?
The Deduplication Pipeline
Recall the system to store files that change over time, where the goal is to avoid writing identical content twice. Implementations vary widely, but any CDC-based deduplication system needs the same core ingredients: a way to split data into chunks, a way to fingerprint each chunk, and a way to check whether that fingerprint has been seen before. The visualization below walks through these ingredients as a simple linear pipeline, though real systems will likely optimize by reordering, parallelizing, or batching these steps.
Walking through the stages: the raw file bytes enter the pipeline and are split into variable-size chunks by the CDC algorithm (here, FastCDC). Each chunk is then fingerprinted with a cryptographic hash like BLAKE3. The system looks up each fingerprint in the chunk store. Chunks that already exist are skipped with only a reference recorded, while genuinely new chunks are written to storage and their fingerprints registered for future lookups. The result is that storing a new version of a file costs only the new chunks, not a full copy.
Each stage in the pipeline maps to just a few lines of code, but together they form a system where redundant data is identified and eliminated before it ever reaches disk or network. When a file changes, only the chunks that were actually modified produce new hashes. The rest match what is already in the store, so they are never written again.
The Core Cost Tradeoffs
Deduplication is not free. Every stage of the pipeline above consumes resources, and the central engineering challenge is deciding where to spend and where to save.[15] The core CDC costs fall into four categories, and they all interact.
Hashing, compression, and chunking
CPU is the first cost you pay, and it shows up in three places: the rolling hash that finds chunk boundaries, the cryptographic hash that fingerprints each chunk, and the compression that shrinks chunks before storage.
The rolling hash itself is cheap: as we saw in Part 2, Gear hash processes each byte with just a shift and a table lookup. The cryptographic hash that follows is the primary CPU bottleneck. SHA-256 and BLAKE3 must process every byte of every chunk to produce a collision-resistant fingerprint.1 With fast chunking algorithms like FastCDC, fingerprinting dominates the CPU profile of the pipeline.[17] Stronger hashes cost more cycles but reduce the probability of two different chunks sharing the same hash to effectively zero.
Then there is compression: most production systems (Restic, Borg, and others) compress each chunk before storing it, typically with zstd or LZ4. Compression adds meaningful CPU cost on writes and a smaller cost on reads (decompression), but it can dramatically reduce the bytes that actually hit disk and network.
All three costs scale linearly with data volume. In practice, BLAKE3 is fast enough that hashing rarely bottlenecks a modern pipeline, and modern compressors like zstd offer tunable speed-vs-ratio tradeoffs, but both represent real work on every byte that enters the system. Systems whose chunks have predictable internal structure can push further: Meta's OpenZL generates compressors tailored to a specific data format, achieving better compression ratios at higher speeds than general-purpose tools can manage.[22]
Chunk index lookups at scale
Memory is where the chunk index lives. The content-addressable store needs a searchable mapping from hash to storage location, and that index must be fast to query because every incoming chunk triggers a lookup. At scale, keeping a full chunk index in RAM becomes impractical, and a disk-based index with one seek per incoming chunk is far too slow.[16][18]
The primary cost driver is chunk count. The index size scales with the number of unique chunks, not with total data volume, which is good. But smaller average chunk sizes mean more chunks per file, which means a larger index. A system with 4 KB average chunks will produce roughly four times as many index entries as one with 16 KB chunks for the same data.
Once the index outgrows a single machine, or needs to be shared across a fleet, it becomes a distributed systems problem: you need a persistent, highly available data store (typically a database or distributed key-value system) to hold the mapping and serve lookups at low latency. That infrastructure has its own operational cost, and it scales with chunk count.
Transfer efficiency through deduplication
Network is often where deduplication pays for itself most visibly. In distributed systems (backup to a remote server, syncing across devices), only new chunks need to traverse the wire. The primary cost driver is the dedup ratio: the fraction of chunks that already exist at the destination and never need to be sent.
The Low-Bandwidth Network File System (LBFS) demonstrated this early on, achieving over an order of magnitude less bandwidth than traditional network file systems by transmitting only chunks not already present at the receiver.[19] If you edit a paragraph in a 10 MB document and the system produces 200 chunks, perhaps only 3 of those are new. That is a transfer of kilobytes instead of megabytes.
Smaller chunks generally improve this ratio because edits are less likely to span an entire small chunk, but each chunk also carries metadata overhead (its hash, its length, its position in the manifest), so there is a point of diminishing returns.
Unique chunks plus metadata overhead
Storage (disk or object store) holds the unique chunks plus all the metadata that lets you reconstruct files from them: hashes, chunk-to-file mappings, version manifests. The primary cost driver is the balance between dedup savings and metadata overhead. Smaller chunks improve deduplication (more sharing opportunities), but they also increase the metadata-to-data ratio.[21] On cloud object stores, chunk count also drives API operations costs: every PUT and GET is priced per request, so more chunks means more billable operations (Part 4 explores how containers address this).
At extremely small chunk sizes (say, 256 bytes), the overhead of storing a 32-byte hash and associated bookkeeping for each chunk becomes a significant fraction of the chunk itself.
Meyer and Bolosky found that for live desktop file systems, where most duplication consists of identical files stored in multiple locations, whole-file deduplication already captures roughly 75% of the savings of fine-grained block-level dedup.[20] But that result is workload-dependent. When files churn frequently and edits are localized within larger files (the pattern that dominates backup, sync, and software distribution), whole-file dedup sees zero savings on each modified file while CDC captures nearly everything. The value of sub-file chunking scales with both how much duplicated content exists and how frequently that content changes.
The explorer below visualizes these four dimensions as you move the chunk size slider. Small chunks push CPU, memory, and metadata overhead up while improving deduplication and network efficiency. Large chunks do the reverse. The sweet spot depends on your workload.
If you experimented with the Parametric Chunking Explorer in Part 2, you saw this tradeoff firsthand: smaller average sizes produced more chunks with tighter size distributions, while larger averages produced fewer, more variable chunks. Those demos showed the statistical effect. In production, the right balance depends on your workload: the volume of duplicated content, the rate at which that content changes, and how each cost dimension (CPU, memory, network, storage) maps onto your constraints. These are the competing forces that determine whether CDC is a valuable strategy for your system, and if so, what average chunk size best balances them. That answer depends on your domain, and only you as the expert in your particular system can make that call. My hope is that the intuitions developed here help you make a more informed decision.
Where CDC Lives Today
Content-defined chunking has become infrastructure, often invisible but always essential. It shows up across three broad categories: backup and archival, file sync and distribution, and content-addressable storage.
Backup and archival tools were the earliest adopters. Restic uses Rabin fingerprints with configurable chunk sizes,[29] Borg uses Buzhash with a secret seed (preventing attackers from predicting chunk boundaries based on known content),[30] and newer tools like Kopia,[31] Duplicacy,[32] Bupstash,[33] and Tarsnap[34] all rely on CDC to deduplicate across snapshots. The pattern is the same in each: split data into content-defined chunks, fingerprint each chunk, and store only the unique ones.
File sync and software distribution use CDC to minimize transfer sizes. Riot Games rebuilt the League of Legends patcher around FastCDC, replacing an older binary-delta system and achieving a tenfold improvement in patching speeds.[27] casync, created by Lennart Poettering, applies CDC to OS and container image distribution, chunking across file boundaries so that updates to a filesystem image only transfer the chunks that actually changed.[35]
Content-addressable storage systems like IPFS use CDC to split files into variable-size blocks before distributing them across a peer-to-peer network.[28] Because chunk boundaries are determined by content rather than position, identical regions of different files naturally converge on the same chunks and the same content addresses.
When CDC Is Not the Right Choice
Not every system chooses CDC, and the cost tradeoffs help explain why. CDC optimizes for one thing above all: stable chunk boundaries across edits. That stability enables fine-grained deduplication, but it comes at a cost, and not every application prioritizes deduplication over other concerns.
Dropbox is the most prominent example. Their architecture uses fixed-size 4 MiB blocks with SHA-256 hashing, and has since the early days of the product.[23] Dropbox’s primary engineering challenge was not deduplication, it was transport: syncing files across hundreds of millions of devices as fast as possible while keeping infrastructure costs predictable.
Fixed-size blocks give Dropbox properties that CDC cannot. Block N always starts at offset N * 4 MiB, so a client can request any block without first receiving a boundary list. Upload work can be split across threads by byte offset with zero coordination, because boundaries are known before the content is read. The receiver knows when each block ends, enabling Dropbox’s streaming sync architecture where downloads begin before the upload finishes, achieving up to 2x improvement on large file sync.[23] And because every block is exactly 4 MiB (except the last), memory allocation, I/O scheduling, and storage alignment are all simple to model and predict at scale.
There is also the metadata question. CDC’s chunk index must be backed by a persistent, highly available data store once it outgrows a single machine. For Dropbox, serving hundreds of millions of users, the difference between a fixed-size block index and a variable-size CDC chunk index is not just memory; it is the size and complexity of the metadata infrastructure required to support it. Fixed-size blocks produce fewer, more predictable index entries, which simplifies that infrastructure considerably.
The tradeoff is real. The QuickSync study found that a minor edit in Dropbox can generate sync traffic 10x the size of the actual modification, because insertions shift every subsequent block boundary.[25] This is precisely the boundary-shift problem that CDC was designed to solve, as we explored in Part 1. But Dropbox chose to absorb that cost and compensate elsewhere: their Broccoli compression encoder achieves ~33% upload bandwidth savings[24], and the streaming sync architecture pipelines work so effectively that the extra bytes matter less than they otherwise would.
In short, Dropbox traded storage efficiency for transport speed and operational simplicity. Fixed-size blocks mean a predictable, easily modeled object count, which is critical when your storage bill depends on API call volume. The ability to parallelize everything without content-dependent coordination was worth more than the deduplication gains CDC would have provided.
Seafile, an open-source file sync platform, takes the opposite approach: it uses Rabin fingerprint-based CDC with ~1 MB average chunks to achieve block-level deduplication across file versions and libraries.[26] Where Dropbox chose to optimize purely for transport, Seafile shows that CDC-based sync systems can work in practice. Part 4 explores how the container abstraction makes this economically viable.
Why Cloud Storage is the Cost that Matters
The four dimensions above assume a local or self-managed storage backend where the cost of writing and reading an object is just disk I/O. Most production systems today do not work that way. They store chunks on cloud object storage (S3, GCS, or Azure Blob Storage), and cloud providers turn every one of those engineering costs into a line item on a bill.
Cloud providers charge not just per GB stored but also per API operation. Every PUT and every GET has a price. When each chunk is its own object, the number of API calls scales with the number of chunks, and that operations cost can dominate the bill entirely. The same knob that the explorer above illustrates (smaller chunks improve dedup but increase chunk count) takes on a new, financially painful dimension: more chunks means more API calls means a larger cloud bill, even if the total bytes stored are fewer.
This is the problem that Part 4: CDC in the Cloud tackles head-on. Grouping chunks into larger, fixed-size containers collapses the object count and makes CDC viable on cloud storage. But containers introduce their own challenges: fragmentation, garbage collection complexity, and restore performance degradation. Part 5 then takes a deep dive into the full cost picture, exploring how different storage providers, caching layers, and container configurations combine to determine the real monthly bill.
- Collision resistance requires that it is computationally infeasible to find two different inputs that produce the same hash. For this guarantee to hold, every bit of the input must influence the output. If the function skipped even a single byte, two inputs differing only in that byte would hash identically, a trivial collision. This is the fundamental difference from rolling hashes used for boundary detection: Gear hash only looks at a sliding window and is not collision-resistant, which is fine for finding chunk boundaries but not for content addressing, where a collision means two different chunks are treated as identical and one gets silently discarded. BLAKE3 is notably faster than SHA-256 here because it uses a Merkle tree structure internally, allowing parts of the input to be hashed in parallel across cores and SIMD lanes, but it still processes every byte. ↩
References
Tools & Implementations
The interactive animations in this post are available for experimentation. Try modifying the input text, adjusting chunk size parameters, and watching how CDC adapts to your changes.