Deduplication in Action

Content-Defined Chunking, Part 3 February 16, 2026 | 19 minutes (3534 words)

Part 3 of 5 in a series on Content-Defined Chunking. Previous: Part 2: A Deep Dive into FastCDC · Next: Part 4: CDC in the Cloud

In Part 1, we explored why content-defined chunking exists and surveyed three algorithm families. In Part 2, we took a deep dive into FastCDC’s GEAR hash, normalized chunking, and how average byte targets affect chunk distribution. In this post, we bring the pieces together to see deduplication in action, examine where CDC is used in practice today and where it is not, and explore the cost tradeoffs that shape real-world systems.

Deduplication in Action

Imagine you are building a system to store files that change over time. Each new version is mostly the same as the last, with only a small edit here or there. As we saw in Part 1, storing a complete copy of every version wastes storage on identical content. Content-defined chunking offers a way out: split each version into chunks based on content, fingerprint each chunk, and only store chunks you have not seen before. But what does that actually look like in practice? The explorer below runs FastCDC on editable text. A small edit has already been saved to show deduplication at work: most chunks are recognized as duplicates and never stored again. Try making your own edits to see how content-defined boundaries respond.

Deduplication Explorer

Click "Save Version" after editing to see which chunks are new and which are shared. Hover over chunks to highlight them across views.

As the demo shows, even a small edit only produces a handful of new chunks while the rest are shared across versions. But how does this work under the hood?

The Deduplication Pipeline

Recall the system to store files that change over time, where the goal is to avoid writing identical content twice. Implementations vary widely, but any CDC-based deduplication system needs the same core ingredients: a way to split data into chunks, a way to fingerprint each chunk, and a way to check whether that fingerprint has been seen before. The visualization below walks through these ingredients as a simple linear pipeline, though real systems will likely optimize by reordering, parallelizing, or batching these steps.

Pipeline

01 File Input Read the file as a raw byte stream.

📄 document.txt raw byte stream

let data = fs::read("document.txt");

02 CDC Chunking Split the stream into variable-size chunks using content-defined boundaries.

A B C D E F

let chunks = FastCDC::new( &data, min, avg, max );

03 Hash Each Chunk Compute a collision-resistant fingerprint for each chunk.

A 7f3a9b2c

B e2b10f87

C 91cda4e3

D a4f8c61d

E c3d752af

F 58eab03e

for chunk in chunks { hash = blake3(chunk.data); }

04 Store Lookup Check whether each fingerprint already exists in the chunk store.

A exists

B new!

C exists

D exists

E exists

F new!

let known = store.contains(hash);

05 Store Decision Write only new chunks; record a reference for duplicates.

Hash Exists

Skip write, record reference

Hash New

Write chunk, register hash

if known { ref(hash) } else { store.put(hash, data) }

Walking through the stages: the raw file bytes enter the pipeline and are split into variable-size chunks by the CDC algorithm (here, FastCDC). Each chunk is then fingerprinted with a cryptographic hash like BLAKE3. The system looks up each fingerprint in the chunk store. Chunks that already exist are skipped with only a reference recorded, while genuinely new chunks are written to storage and their fingerprints registered for future lookups. The result is that storing a new version of a file costs only the new chunks, not a full copy.

Each stage in the pipeline maps to just a few lines of code, but together they form a system where redundant data is identified and eliminated before it ever reaches disk or network. When a file changes, only the chunks that were actually modified produce new hashes. The rest match what is already in the store, so they are never written again.

The Core Cost Tradeoffs

Deduplication is not free. Every stage of the pipeline above consumes resources, and the central engineering challenge is deciding where to spend and where to save.[15] The core CDC costs fall into four categories, and they all interact.

Hashing, compression, and chunking

CPU is the first cost you pay, and it shows up in three places: the rolling hash that finds chunk boundaries, the cryptographic hash that fingerprints each chunk, and the compression that shrinks chunks before storage.

The rolling hash itself is cheap: as we saw in Part 2, Gear hash processes each byte with just a shift and a table lookup. The cryptographic hash that follows is the primary CPU bottleneck. SHA-256 and BLAKE3 must process every byte of every chunk to produce a collision-resistant fingerprint.1 With fast chunking algorithms like FastCDC, fingerprinting dominates the CPU profile of the pipeline.[17] Stronger hashes cost more cycles but reduce the probability of two different chunks sharing the same hash to effectively zero.

Then there is compression: most production systems (Restic, Borg, and others) compress each chunk before storing it, typically with zstd or LZ4. Compression adds meaningful CPU cost on writes and a smaller cost on reads (decompression), but it can dramatically reduce the bytes that actually hit disk and network.

All three costs scale linearly with data volume. In practice, BLAKE3 is fast enough that hashing rarely bottlenecks a modern pipeline, and modern compressors like zstd offer tunable speed-vs-ratio tradeoffs, but both represent real work on every byte that enters the system. Systems whose chunks have predictable internal structure can push further: Meta's OpenZL generates compressors tailored to a specific data format, achieving better compression ratios at higher speeds than general-purpose tools can manage.[22]

Chunk index lookups at scale

Memory is where the chunk index lives. The content-addressable store needs a searchable mapping from hash to storage location, and that index must be fast to query because every incoming chunk triggers a lookup. At scale, keeping a full chunk index in RAM becomes impractical, and a disk-based index with one seek per incoming chunk is far too slow.[16][18]

The primary cost driver is chunk count. The index size scales with the number of unique chunks, not with total data volume, which is good. But smaller average chunk sizes mean more chunks per file, which means a larger index. A system with 4 KB average chunks will produce roughly four times as many index entries as one with 16 KB chunks for the same data.

Once the index outgrows a single machine, or needs to be shared across a fleet, it becomes a distributed systems problem: you need a persistent, highly available data store (typically a database or distributed key-value system) to hold the mapping and serve lookups at low latency. That infrastructure has its own operational cost, and it scales with chunk count.

Transfer efficiency through deduplication

Network is often where deduplication pays for itself most visibly. In distributed systems (backup to a remote server, syncing across devices), only new chunks need to traverse the wire. The primary cost driver is the dedup ratio: the fraction of chunks that already exist at the destination and never need to be sent.

The Low-Bandwidth Network File System (LBFS) demonstrated this early on, achieving over an order of magnitude less bandwidth than traditional network file systems by transmitting only chunks not already present at the receiver.[19] If you edit a paragraph in a 10 MB document and the system produces 200 chunks, perhaps only 3 of those are new. That is a transfer of kilobytes instead of megabytes.

Smaller chunks generally improve this ratio because edits are less likely to span an entire small chunk, but each chunk also carries metadata overhead (its hash, its length, its position in the manifest), so there is a point of diminishing returns.

Unique chunks plus metadata overhead

Storage (disk or object store) holds the unique chunks plus all the metadata that lets you reconstruct files from them: hashes, chunk-to-file mappings, version manifests. The primary cost driver is the balance between dedup savings and metadata overhead. Smaller chunks improve deduplication (more sharing opportunities), but they also increase the metadata-to-data ratio.[21] On cloud object stores, chunk count also drives API operations costs: every PUT and GET is priced per request, so more chunks means more billable operations (Part 4 explores how containers address this).

At extremely small chunk sizes (say, 256 bytes), the overhead of storing a 32-byte hash and associated bookkeeping for each chunk becomes a significant fraction of the chunk itself.

Meyer and Bolosky found that for live desktop file systems, where most duplication consists of identical files stored in multiple locations, whole-file deduplication already captures roughly 75% of the savings of fine-grained block-level dedup.[20] But that result is workload-dependent. When files churn frequently and edits are localized within larger files (the pattern that dominates backup, sync, and software distribution), whole-file dedup sees zero savings on each modified file while CDC captures nearly everything. The value of sub-file chunking scales with both how much duplicated content exists and how frequently that content changes.

Average chunk size is the single parameter that ties all four costs together.[15][21] Turning it down (smaller chunks) improves deduplication ratio and network efficiency but increases CPU work, index memory, and metadata overhead. Turning it up (larger chunks) reduces overhead but sacrifices dedup granularity. The right setting depends on your domain.

The explorer below visualizes these four dimensions as you move the chunk size slider. Small chunks push CPU, memory, and metadata overhead up while improving deduplication and network efficiency. Large chunks do the reverse. The sweet spot depends on your workload.

Cost Tradeoffs Explorer

Average Chunk Size: 32 KB Drag the slider to see how average chunk size affects each cost dimension.

Relative Pressure →

These bars show the direction and shape of each tradeoff, not exact magnitudes. CPU and memory costs scale with chunk count (more chunks = more hashing, larger index). Network cost decreases with smaller chunks because the higher deduplication ratio means less unique data to transfer. Storage has a U-shape: very small chunks incur metadata overhead, while very large chunks reduce deduplication and store more redundant data.

If you experimented with the Parametric Chunking Explorer in Part 2, you saw this tradeoff firsthand: smaller average sizes produced more chunks with tighter size distributions, while larger averages produced fewer, more variable chunks. Those demos showed the statistical effect. In production, the right balance depends on your workload: the volume of duplicated content, the rate at which that content changes, and how each cost dimension (CPU, memory, network, storage) maps onto your constraints. These are the competing forces that determine whether CDC is a valuable strategy for your system, and if so, what average chunk size best balances them. That answer depends on your domain, and only you as the expert in your particular system can make that call. My hope is that the intuitions developed here help you make a more informed decision.

Where CDC Lives Today

Content-defined chunking has become infrastructure, often invisible but always essential. It shows up across three broad categories: backup and archival, file sync and distribution, and content-addressable storage.

Backup and archival tools were the earliest adopters. Restic uses Rabin fingerprints with configurable chunk sizes,[29] Borg uses Buzhash with a secret seed (preventing attackers from predicting chunk boundaries based on known content),[30] and newer tools like Kopia,[31] Duplicacy,[32] Bupstash,[33] and Tarsnap[34] all rely on CDC to deduplicate across snapshots. The pattern is the same in each: split data into content-defined chunks, fingerprint each chunk, and store only the unique ones.

File sync and software distribution use CDC to minimize transfer sizes. Riot Games rebuilt the League of Legends patcher around FastCDC, replacing an older binary-delta system and achieving a tenfold improvement in patching speeds.[27] casync, created by Lennart Poettering, applies CDC to OS and container image distribution, chunking across file boundaries so that updates to a filesystem image only transfer the chunks that actually changed.[35]

Content-addressable storage systems like IPFS use CDC to split files into variable-size blocks before distributing them across a peer-to-peer network.[28] Because chunk boundaries are determined by content rather than position, identical regions of different files naturally converge on the same chunks and the same content addresses.

When CDC Is Not the Right Choice

Not every system chooses CDC, and the cost tradeoffs help explain why. CDC optimizes for one thing above all: stable chunk boundaries across edits. That stability enables fine-grained deduplication, but it comes at a cost, and not every application prioritizes deduplication over other concerns.

Dropbox is the most prominent example. Their architecture uses fixed-size 4 MiB blocks with SHA-256 hashing, and has since the early days of the product.[23] Dropbox’s primary engineering challenge was not deduplication, it was transport: syncing files across hundreds of millions of devices as fast as possible while keeping infrastructure costs predictable.

Fixed-size blocks give Dropbox properties that CDC cannot. Block N always starts at offset N * 4 MiB, so a client can request any block without first receiving a boundary list. Upload work can be split across threads by byte offset with zero coordination, because boundaries are known before the content is read. The receiver knows when each block ends, enabling Dropbox’s streaming sync architecture where downloads begin before the upload finishes, achieving up to 2x improvement on large file sync.[23] And because every block is exactly 4 MiB (except the last), memory allocation, I/O scheduling, and storage alignment are all simple to model and predict at scale.

There is also the metadata question. CDC’s chunk index must be backed by a persistent, highly available data store once it outgrows a single machine. For Dropbox, serving hundreds of millions of users, the difference between a fixed-size block index and a variable-size CDC chunk index is not just memory; it is the size and complexity of the metadata infrastructure required to support it. Fixed-size blocks produce fewer, more predictable index entries, which simplifies that infrastructure considerably.

The tradeoff is real. The QuickSync study found that a minor edit in Dropbox can generate sync traffic 10x the size of the actual modification, because insertions shift every subsequent block boundary.[25] This is precisely the boundary-shift problem that CDC was designed to solve, as we explored in Part 1. But Dropbox chose to absorb that cost and compensate elsewhere: their Broccoli compression encoder achieves ~33% upload bandwidth savings[24], and the streaming sync architecture pipelines work so effectively that the extra bytes matter less than they otherwise would.

In short, Dropbox traded storage efficiency for transport speed and operational simplicity. Fixed-size blocks mean a predictable, easily modeled object count, which is critical when your storage bill depends on API call volume. The ability to parallelize everything without content-dependent coordination was worth more than the deduplication gains CDC would have provided.

Seafile, an open-source file sync platform, takes the opposite approach: it uses Rabin fingerprint-based CDC with ~1 MB average chunks to achieve block-level deduplication across file versions and libraries.[26] Where Dropbox chose to optimize purely for transport, Seafile shows that CDC-based sync systems can work in practice. Part 4 explores how the container abstraction makes this economically viable.

Why Cloud Storage is the Cost that Matters

The four dimensions above assume a local or self-managed storage backend where the cost of writing and reading an object is just disk I/O. Most production systems today do not work that way. They store chunks on cloud object storage (S3, GCS, or Azure Blob Storage), and cloud providers turn every one of those engineering costs into a line item on a bill.

Cloud providers charge not just per GB stored but also per API operation. Every PUT and every GET has a price. When each chunk is its own object, the number of API calls scales with the number of chunks, and that operations cost can dominate the bill entirely. The same knob that the explorer above illustrates (smaller chunks improve dedup but increase chunk count) takes on a new, financially painful dimension: more chunks means more API calls means a larger cloud bill, even if the total bytes stored are fewer.

This is the problem that Part 4: CDC in the Cloud tackles head-on. Grouping chunks into larger, fixed-size containers collapses the object count and makes CDC viable on cloud storage. But containers introduce their own challenges: fragmentation, garbage collection complexity, and restore performance degradation. Part 5 then takes a deep dive into the full cost picture, exploring how different storage providers, caching layers, and container configurations combine to determine the real monthly bill.

Collision resistance requires that it is computationally infeasible to find two different inputs that produce the same hash. For this guarantee to hold, every bit of the input must influence the output. If the function skipped even a single byte, two inputs differing only in that byte would hash identically, a trivial collision. This is the fundamental difference from rolling hashes used for boundary detection: Gear hash only looks at a sliding window and is not collision-resistant, which is fine for finding chunk boundaries but not for content addressing, where a collision means two different chunks are treated as identical and one gets silently discarded. BLAKE3 is notably faster than SHA-256 here because it uses a Merkle tree structure internally, allowing parts of the input to be hashed in parallel across cores and SIMD lanes, but it still processes every byte. ↩

References

[15]

W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang & Y. Zhou, "A Comprehensive Study of the Past, Present, and Future of Data Deduplication," Proceedings of the IEEE, vol. 104, no. 9, pp. 1681-1710, September 2016.

IEEE

[16]

B. Zhu, K. Li & H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," 6th USENIX Conference on File and Storage Technologies (FAST '08), San Jose, CA, February 2008.

PDF

[17]

W. Xia, X. Zou, Y. Zhou, H. Jiang, C. Liu, D. Feng, Y. Hua, Y. Hu & Y. Zhang, "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 9, pp. 2017-2031, 2020.

PDF

[18]

M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise & P. Camble, "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," 7th USENIX Conference on File and Storage Technologies (FAST '09), San Jose, CA, February 2009.

USENIX

[19]

A. Muthitacharoen, B. Chen & D. Mazières, "A Low-bandwidth Network File System," 18th ACM Symposium on Operating Systems Principles (SOSP '01), Banff, Canada, October 2001.

PDF

[20]

D. T. Meyer & W. J. Bolosky, "A Study of Practical Deduplication," 9th USENIX Conference on File and Storage Technologies (FAST '11), San Jose, CA, February 2011.

PDF

[21]

H. Wu, C. Wang, K. Lu, Y. Fu & L. Zhu, "One Size Does Not Fit All: The Case for Chunking Configuration in Backup Deduplication," 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '18), 2018.

IEEE

[22]

Y. Collet, N. Terrell, W. F. Handte, D. Rozenblit, V. Zhang, K. Zhang, Y. Goldschlag, J. Lee, E. Gorokhovsky, Y. Komornik, D. Riegel, S. Angelov & N. Rotem, "OpenZL: A Graph-Based Model for Compression," arXiv:2510.03203, October 2025.

arXiv Project

[23]

N. Koorapati, "Streaming File Synchronization," Dropbox Tech Blog, July 2014.

Blog

[24]

R. Jain & D. R. Horn, "Broccoli: Syncing Faster by Syncing Less," Dropbox Tech Blog, August 2020.

Blog

[25]

Y. Cui, Z. Lai, N. Dai & X. Wang, "QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services," IEEE Transactions on Mobile Computing, vol. 16, no. 12, pp. 3513-3526, 2017.

IEEE

[26]

Seafile Ltd., "Data Model," Seafile Administration Manual. CDC implementation: seafile-server/common/cdc/cdc.c.

Docs GitHub

[27]

Riot Games Technology, "Supercharging Data Delivery: The New League Patcher," 2019. Describes the move from binary deltas to FastCDC-based content-defined chunking for game updates.

Article

[28]

IPFS Documentation, "File Systems: Chunking." Describes IPFS's use of Rabin fingerprinting for content-defined chunking alongside fixed-size splitting.

Docs GitHub

[29]

A. Neumann, "Restic Foundation - Content Defined Chunking," Restic Blog, September 2015. Describes Restic's use of Rabin fingerprint-based CDC for deduplication.

Blog GitHub

[30]

BorgBackup Contributors, "Internals: Data Structures," BorgBackup Documentation. Describes Borg's Buzhash-based chunker with a keyed seed for chunk boundary detection.

Docs GitHub

[31]

Kopia Contributors, "Architecture," Kopia Documentation. Describes Kopia's rolling-hash file splitting for content-addressable deduplication.

Docs GitHub

[32]

G. Chen, "Variable-size Chunking," Duplicacy Design Document. Describes Duplicacy's variable-size CDC with configurable minimum, average, and maximum chunk sizes.

Wiki GitHub

[33]

A. Chambers, "Technical Overview," Bupstash Documentation. Describes Bupstash's GearHash-based content-defined chunking for encrypted, deduplicated backups.

Docs GitHub

[34]

C. Percival, "How Tarsnap Deduplication Works," Tarsnap Documentation. Describes Tarsnap's context-dependent variable-size chunking for client-side deduplication.

Docs Site

[35]

L. Poettering, "casync: Content-Addressable Data Synchronization Tool," 2017. Uses CDC to split filesystem images into variable-size chunks for efficient OS and container image distribution.

GitHub LWN

Tools & Implementations

fastcdc-rs on GitHub

The interactive animations in this post are available for experimentation. Try modifying the input text, adjusting chunk size parameters, and watching how CDC adapts to your changes.

← Part 2: A Deep Dive into FastCDC · Continue reading → Part 4: CDC in the Cloud