From Problem to Taxonomy
Content-Defined Chunking, Part 1 February 02, 2026 | 27 minutes (5024 words)Have you ever considered what it takes to store large amounts of user content at scale? Especially content like files or source code, where multiple versions of the same document exist with only minor changes between them. Itβs easy to take document storage for granted, but storing content at scale efficiently is a surprisingly nuanced problem. While there are numerous aspects to this problem space, this series focuses on one in particular: deduplication. How do you avoid storing the same unchanged bytes of a file over and over again as that file evolves? Ideally, a storage system would only store the minimum set of unique bytes. Is that even possible? This series will help answer that question, and more.
What is deduplication? At its heart, it is the separation of content that changed from content that has not, with varying levels of precision and accuracy. But how do you identify what changed? Thatβs where the content-defined chunking (CDC) family of algorithms offers help. These algorithms share a common goal: splitting a file into smaller chunks at byte boundaries determined by the contentβs structure.
Ideally, a small edit to a file should minimize the impact to the surrounding chunks whose content has not changed, resulting in new chunks that capture the edit while preserving existing chunks. Diverse content brings unique structure, and not all content can be chunked the same way. As we will see, each CDC algorithm and its approach affords benefits and tradeoffs.
Weβll survey CDC algorithms, looking at their benefits and tradeoffs, before focusing on a popular and general-purpose CDC algorithm, FastCDC.
This series walks you through an overview of all CDC algorithms, grounded by their underlying 40 years of CDC research. Next, youβll deep dive into a popular and general-purpose CDC algorithm, FastCDC, to deeply understand how it works and what makes it so fast. To better contextualize how CDC works in production systems, youβll walk through a general deduplication pipeline. This sets the stage to begin understanding CDCβs costs and tradeoffs, what factors dominate deduplication system costs at scale, and how to navigate cloud storage decisions.
Along the way, there are several interactive visualizations provided to help build mental models for the algorithms, their constraints, and costs associated with deduplication systems. You can step through how FastCDC identifies chunk boundaries, see the impact of normalized chunking, and even see how chunk size impacts storage pressure while making edits to a file. As the discussion focuses more on costs, the visualizations allow for manipulating container capacities and cache hit rates, all contextualized within the pricing schemes of popular cloud object storage and cache providers, making it easy to see the severe impacts levers like chunk size and container size have on operating costs for deduplication systems at scale.
- Motivating the Problem
- Three Families of CDC
- A Closer Look at BSW via FastCDC (Part 2)
- Deduplication in Action (Part 3)
- CDC in the Cloud (Part 4)
- CDC at Scale on a Budget (Part 5)
- References
-
Visualizations
- Fixed-Size Chunking
- Content-Defined Chunking
- Gear Hash in Action
- Basic vs Normalized Chunk Size Distribution
- Parametric Chunking Explorer
- Deduplication Explorer
- Cost Tradeoffs Explorer
- Established Object Storage Provider Cost Explorer
- Established Object Storage Provider Cost Explorer with Containers
- Challenger Object Storage Provider Cost Explorer
- Established vs. Challenger Object Storage Provider Cost Comparison
- Zipf Popularity Distribution
- Cache Size vs. Hit Rate
- Established Cache Provider Cost Explorer
- Challenger Cache Provider Cost Explorer
- Comprehensive Cost Model
Motivating the Problem
Imagine youβre building a backup system. A user stores a 500MB file, then modifies a single paragraph and saves it again. In a naive system, this results in two nearly identical copies of the same file. Despite the small change of a single paragraph, the storage system grew from 500MB to 1GB. Surely we can do better.
This is the deduplication problem, and it shows up in many familiar places: cloud blob storage providers managing petabytes of user files (e.g. Amazon S3 or Azure Blob Storage), cloud file servers like Google Drive or iCloud, and software backup tools like Restic and Borg.
The simplest form of deduplication is whole-file comparison: hash the entire file, and if two files produce the same hash, store only one copy. This works well for exact duplicates, but falls apart with even the smallest edit. Change a single byte and the hash changes completely, so the system treats the original and the edited version as two entirely different files.
One fix is to reduce the granularity of comparison. Instead of hashing a file as a single unit, split it into smaller segments called chunks and hash each chunk independently. A small edit now only affects the chunks near the change, leaving the rest unchanged. Those unchanged chunks can be recognized as duplicates and stored only once. The question then becomes: how should we decide where to split?
Why Not Just Use Fixed-Size Chunks?
The naive approach to chunking is fixed-size splitting: choose a chunk size, say 4KB, and split the file at every 4KB boundary. A 1MB file becomes 256 chunks of 4KB each. This approach is conceptually simple, but is problematic if we want to prevent change amplification β the invalidation of unchanged chunks when small edits occur. Using this naive chunking strategy, letβs see what happens to unchanged chunks when a small edit occurs at the beginning of a file:
Inserting βNEW INTRO.β at the beginning of the file causes every chunk boundary to shift, invalidating all five original chunks. The result is five new chunks and zero unchanged chunks, producing a deduplication ratio of 0%. In practice, this means the entire file would need to be stored again, even though most of its content did not change. We need a chunking strategy whose boundaries are not determined by fixed byte offsets, and that offers more flexibility to identify split points that better preserve unchanged chunks.
The Core Idea: Content as the Arbiter
Instead of using fixed-length byte windows to split a file into chunks, what if we could use patterns or structure in the fileβs content to identify chunk boundaries? This is the core problem a family of algorithms known as content-defined chunking (CDC) attempts to solve.
How does CDC decide where to split? The details vary across algorithms, but the core principle is the same: examine a small region of data at each position, and declare a boundary when the content at that position satisfies some condition. Different algorithms use different strategies for this. Some compute a hash of a sliding window, some look for local extrema in the byte values, and some use statistical properties of the data. What they all share is that the boundary decision, or split point, is dependent on the content itself.
Letβs revisit the same example from before, but this time we split the text at sentence boundaries. Each sentence ending (a period, exclamation mark, or question mark followed by a space) defines a chunk boundary. Because the boundary is determined by the content itself, not by a fixed byte count, inserting text at the beginning of the file does not invalidate existing unchanged chunks.
Inserting βNEW INTRO.β creates just one new chunk. The original five sentences are unchanged, so their chunks are identical to before. The result is a much higher deduplication ratio, meaning we only need to store the new chunk and can reference the existing chunks for the rest of the file.
Three Families of CDC
Origins
The story begins with Turing Award winner Michael Rabin, who introduced polynomial fingerprinting in 1981.[1] His key insight: represent a sequence of bytes as a polynomial and evaluate it at a random point to get a βfingerprintβ that uniquely identifies the content with high probability. More importantly, this fingerprint could be computed incrementally as a rolling hash, making it efficient to slide across data.
For a sequence of bytes $b_0, b_1, \ldots, b_{n-1}$, the fingerprint is:
\[f(x) = b_0 + b_1 \cdot x + b_2 \cdot x^2 + \ldots + b_{n-1} \cdot x^{n-1} \mod p\]where $p$ is an irreducible polynomial over $GF(2)$.
Twenty years later, the Low-Bandwidth File System (LBFS) at MIT became the first major system to use CDC in practice.[2] LBFS slid a 48-byte window across the data and computed a Rabin fingerprint at each position. Whenever the low 13 bits of the fingerprint equaled a magic constant, it declared a chunk boundary, producing an average chunk size of about 8KB. The breakthrough was demonstrating that CDC could achieve dramatic bandwidth savings for real file workloads. Modifying a single paragraph in a large document transmitted only the changed chunk, not the entire file.
1
2
3
4
5
6
// Simplified LBFS boundary check
if ((fingerprint % 8192) == 0x78) {
// This is a chunk boundary
emit_chunk(start, current_position);
start = current_position;
}
The deduplication era of 2005-2015 drove an explosion of CDC research. Many successful systems built deduplication pipelines using CDC based on advances in research that produced faster hash functions, better chunk size distributions, and new ways of finding chunk boundaries. By the mid-2010s, what had been a single technique branched into a family of algorithms with fundamentally different strategies.
A Taxonomy of CDC Algorithms
A comprehensive 2024 survey by Gregoriadis et al.[12] organizes the CDC landscape into three distinct families based on their core mechanism for finding chunk boundaries. This taxonomy clarifies a field that can otherwise feel like a confusing proliferation of acronyms.
| BSW (Basic Sliding Window) | Local Extrema | Statistical | |
|---|---|---|---|
| Algorithms | Rabin 1981, Buzhash 1997, Gear 2014, FastCDC 2016, PCI 2020 | AE 2015, RAM 2017, MII 2019, VectorCDC 2025 | BFBC 2020 |
| Core operation | Rolling hash + mask | Byte comparisons | Frequency analysis |
| Throughput | Medium–High | High | Medium |
| Dedup ratio | High | Comparable | Dataset-dependent |
| SIMD-friendly | Limited | Excellent | Limited |
| Streaming | Yes | Yes | No (pre-scan) |
| Chunk distribution | Exponential (improved with NC) | Varies | Varies |
| Used in practice | Restic, Borg, FastCDC | Research | Research |
Algorithmic Timeline
1
2
3
4
5
6
7
8
9
// Rabin fingerprint: rolling hash over GF(2)
uint64_t fp = 0;
for (size_t i = 0; i < len; i++) {
fp ^= shift_table[window[i % w]]; // remove outgoing byte
fp = (fp << 8) | data[i]; // shift in new byte
if (fp & HIGH_BIT) fp ^= poly; // reduce mod irreducible poly
window[i % w] = data[i];
if ((fp % D) == r) return i; // boundary!
}
1
2
3
4
5
6
7
8
9
10
11
// Buzhash: cyclic polynomial rolling hash
uint32_t table[256]; // random values, initialized once
uint32_t h = 0;
for (size_t i = 0; i < len; i++) {
h = ROTATE_LEFT(h, 1); // cyclic shift by 1
h ^= ROTATE_LEFT(table[window[i % w]], w); // remove outgoing
h ^= table[data[i]]; // add incoming
window[i % w] = data[i];
if ((h % D) == r) return i; // boundary!
}
hash = (hash << 1) + table[byte]. Since older bits naturally shift out of a 64-bit register, the hash is dominated by the most recent ~64 bytes. The insight is that for CDC purposes, you don't need a true sliding window hash; an approximate one where old bytes decay away is sufficient, since boundary decisions are local. One shift + one add gives the tightest inner loop of any CDC hash, roughly 2-3× faster than Buzhash.1
2
3
4
5
6
7
8
9
// Gear hash: feedforward, no window needed
uint64_t gear_table[256]; // random 64-bit values
uint64_t hash = 0;
for (size_t i = min_size; i < len; i++) {
hash = (hash << 1) + gear_table[data[i]];
if ((hash & mask) == 0)
return i; // boundary!
}
1
2
3
4
5
6
7
8
9
10
// AE: boundary when max byte is at window's right edge
for (size_t i = min_size; i < len; i++) {
uint8_t max_val = 0;
size_t max_pos = 0;
for (int j = i - w + 1; j <= i; j++) {
if (data[j] >= max_val)
{ max_val = data[j]; max_pos = j; }
}
if (max_pos == i) return i; // boundary!
}
min_size bytes), FastCDC reported 10× throughput over Rabin-based CDC while matching or exceeding its deduplication ratio.1
2
3
4
5
6
7
8
9
10
11
12
// FastCDC: Gear hash + normalized chunking
uint64_t hash = 0;
size_t i = min;
for (; i < avg && i < len; i++) { // phase 1: strict mask
hash = (hash << 1) + gear_table[data[i]];
if (!(hash & mask_s)) return i;
}
for (; i < max && i < len; i++) { // phase 2: loose mask
hash = (hash << 1) + gear_table[data[i]];
if (!(hash & mask_l)) return i;
}
return i; // hit max chunk size
1
2
3
4
5
6
7
8
9
10
11
12
13
// RAM: skip to the max, don't scan past it
size_t i = min;
while (i < len) {
size_t max_pos = i;
for (size_t j = i+1; j <= i+ahead; j++) // scan lookahead
if (data[j] >= data[max_pos]) max_pos = j;
if (max_pos != i) { i = max_pos; continue; } // skip!
bool ok = 1;
for (size_t j = i-back; j < i; j++) // check lookback
if (data[j] > data[i]) { ok = 0; break; }
if (ok) return i; // boundary!
i++;
}
1
2
3
4
5
6
7
8
9
10
11
12
// MII: boundary candidates are independent of chunk size
size_t last = 0;
for (size_t i = 0; i < len; i++) {
bool is_max = 1;
for (int j = -W/2; j <= W/2; j++) // large symmetric window
if (data[i+j] > data[i]) { is_max = 0; break; }
if (!is_max) continue;
if (i - last >= min) { // filter by min chunk size
last = i;
emit_boundary(i); // boundary!
}
}
POPCNT instructions that count bits in a single cycle. No hash tables, no polynomial arithmetic, no random lookup tables. It is just counting bits in the raw data. The sliding window update is also simple: add the incoming byte's popcount, subtract the outgoing byte's.1
2
3
4
5
6
7
8
9
// PCI: boundary when bit-population exceeds threshold
int pop_sum = 0;
for (size_t i = 0; i < len; i++) {
pop_sum += __builtin_popcount(data[i]); // add incoming
if (i >= w)
pop_sum -= __builtin_popcount(data[i-w]); // remove outgoing
if (i >= min && pop_sum >= threshold)
return i; // boundary!
}
1
2
3
4
5
6
7
8
9
10
11
// BFBC Phase 1: build digram frequency table
uint32_t freq[256][256] = {0};
for (size_t i = 0; i + 1 < len; i++)
freq[data[i]][data[i+1]]++;
bool is_cut[256][256] = select_top_k(freq, k);
// Phase 2: chunk using frequent digrams as boundaries
for (size_t i = min; i < len - 1; i++) {
if (is_cut[data[i]][data[i+1]])
return i + 1; // boundary after digram
}
_mm256_max_epu8 to compare all bytes simultaneously, extracting boundary candidates via movemask. Hash-based algorithms resist this because each hash update depends sequentially on the previous one. The VRAM variant (vectorized RAM) achieves 16-42× throughput over scalar implementations, approaching memory bandwidth limits (~10-15 GB/s). Deduplication ratios remain identical since the boundary decisions are mathematically equivalent.1
2
3
4
5
6
7
8
9
10
11
// VectorCDC: SIMD-accelerated local max (AVX2)
for (; i + 32 <= len; i += 32) {
__m256i curr = _mm256_loadu_si256(data + i);
__m256i prev = _mm256_loadu_si256(data + i - 1);
__m256i next = _mm256_loadu_si256(data + i + 1);
__m256i is_max = _mm256_and_si256(
_mm256_cmpeq_epi8(curr, _mm256_max_epu8(curr, prev)),
_mm256_cmpeq_epi8(curr, _mm256_max_epu8(curr, next)));
uint32_t mask = _mm256_movemask_epi8(is_max);
if (mask) return i + __builtin_ctz(mask); // first local max
}
In the next post, Part 2: A Deep Dive into FastCDC, weβll take a closer look at the BSW family through FastCDC, an algorithm that combines Gear hashing with normalized chunking to achieve both high throughput and excellent deduplication.