From Problem to Taxonomy

Content-Defined Chunking, Part 1 February 02, 2026 | 27 minutes (5024 words)

Part 1 of 5 in a series on Content-Defined Chunking. Next: Part 2: A Deep Dive into FastCDC

Have you ever considered what it takes to store large amounts of user content at scale? Especially content like files or source code, where multiple versions of the same document exist with only minor changes between them. It’s easy to take document storage for granted, but storing content at scale efficiently is a surprisingly nuanced problem. While there are numerous aspects to this problem space, this series focuses on one in particular: deduplication. How do you avoid storing the same unchanged bytes of a file over and over again as that file evolves? Ideally, a storage system would only store the minimum set of unique bytes. Is that even possible? This series will help answer that question, and more.

What is deduplication? At its heart, it is the separation of content that changed from content that has not, with varying levels of precision and accuracy. But how do you identify what changed? That’s where the content-defined chunking (CDC) family of algorithms offers help. These algorithms share a common goal: splitting a file into smaller chunks at byte boundaries determined by the content’s structure.

Ideally, a small edit to a file should minimize the impact to the surrounding chunks whose content has not changed, resulting in new chunks that capture the edit while preserving existing chunks. Diverse content brings unique structure, and not all content can be chunked the same way. As we will see, each CDC algorithm and its approach affords benefits and tradeoffs.

We’ll survey CDC algorithms, looking at their benefits and tradeoffs, before focusing on a popular and general-purpose CDC algorithm, FastCDC.

This series walks you through an overview of all CDC algorithms, grounded by their underlying 40 years of CDC research. Next, you’ll deep dive into a popular and general-purpose CDC algorithm, FastCDC, to deeply understand how it works and what makes it so fast. To better contextualize how CDC works in production systems, you’ll walk through a general deduplication pipeline. This sets the stage to begin understanding CDC’s costs and tradeoffs, what factors dominate deduplication system costs at scale, and how to navigate cloud storage decisions.

Along the way, there are several interactive visualizations provided to help build mental models for the algorithms, their constraints, and costs associated with deduplication systems. You can step through how FastCDC identifies chunk boundaries, see the impact of normalized chunking, and even see how chunk size impacts storage pressure while making edits to a file. As the discussion focuses more on costs, the visualizations allow for manipulating container capacities and cache hit rates, all contextualized within the pricing schemes of popular cloud object storage and cache providers, making it easy to see the severe impacts levers like chunk size and container size have on operating costs for deduplication systems at scale.

Contents

Motivating the Problem

Imagine you’re building a backup system. A user stores a 500MB file, then modifies a single paragraph and saves it again. In a naive system, this results in two nearly identical copies of the same file. Despite the small change of a single paragraph, the storage system grew from 500MB to 1GB. Surely we can do better.

This is the deduplication problem, and it shows up in many familiar places: cloud blob storage providers managing petabytes of user files (e.g. Amazon S3 or Azure Blob Storage), cloud file servers like Google Drive or iCloud, and software backup tools like Restic and Borg.

The simplest form of deduplication is whole-file comparison: hash the entire file, and if two files produce the same hash, store only one copy. This works well for exact duplicates, but falls apart with even the smallest edit. Change a single byte and the hash changes completely, so the system treats the original and the edited version as two entirely different files.

One fix is to reduce the granularity of comparison. Instead of hashing a file as a single unit, split it into smaller segments called chunks and hash each chunk independently. A small edit now only affects the chunks near the change, leaving the rest unchanged. Those unchanged chunks can be recognized as duplicates and stored only once. The question then becomes: how should we decide where to split?

Why Not Just Use Fixed-Size Chunks?

The naive approach to chunking is fixed-size splitting: choose a chunk size, say 4KB, and split the file at every 4KB boundary. A 1MB file becomes 256 chunks of 4KB each. This approach is conceptually simple, but is problematic if we want to prevent change amplification – the invalidation of unchanged chunks when small edits occur. Using this naive chunking strategy, let’s see what happens to unchanged chunks when a small edit occurs at the beginning of a file:

Fixed-Size Chunking (48 bytes)

Inserting “NEW INTRO.” at the beginning of the file causes every chunk boundary to shift, invalidating all five original chunks. The result is five new chunks and zero unchanged chunks, producing a deduplication ratio of 0%. In practice, this means the entire file would need to be stored again, even though most of its content did not change. We need a chunking strategy whose boundaries are not determined by fixed byte offsets, and that offers more flexibility to identify split points that better preserve unchanged chunks.

The Core Idea: Content as the Arbiter

Instead of using fixed-length byte windows to split a file into chunks, what if we could use patterns or structure in the file’s content to identify chunk boundaries? This is the core problem a family of algorithms known as content-defined chunking (CDC) attempts to solve.

How does CDC decide where to split? The details vary across algorithms, but the core principle is the same: examine a small region of data at each position, and declare a boundary when the content at that position satisfies some condition. Different algorithms use different strategies for this. Some compute a hash of a sliding window, some look for local extrema in the byte values, and some use statistical properties of the data. What they all share is that the boundary decision, or split point, is dependent on the content itself.

Let’s revisit the same example from before, but this time we split the text at sentence boundaries. Each sentence ending (a period, exclamation mark, or question mark followed by a space) defines a chunk boundary. Because the boundary is determined by the content itself, not by a fixed byte count, inserting text at the beginning of the file does not invalidate existing unchanged chunks.

Content-Defined Chunking (sentence boundaries)

Inserting “NEW INTRO.” creates just one new chunk. The original five sentences are unchanged, so their chunks are identical to before. The result is a much higher deduplication ratio, meaning we only need to store the new chunk and can reference the existing chunks for the rest of the file.

When chunk boundaries are defined by the content itself rather than by fixed byte offsets, a small edit only affects the chunks near the change. The rest of the file's chunks remain identical and can be deduplicated.

Three Families of CDC

Origins

The story begins with Turing Award winner Michael Rabin, who introduced polynomial fingerprinting in 1981.[1] His key insight: represent a sequence of bytes as a polynomial and evaluate it at a random point to get a “fingerprint” that uniquely identifies the content with high probability. More importantly, this fingerprint could be computed incrementally as a rolling hash, making it efficient to slide across data.

For a sequence of bytes $b_0, b_1, \ldots, b_{n-1}$, the fingerprint is:

\[f(x) = b_0 + b_1 \cdot x + b_2 \cdot x^2 + \ldots + b_{n-1} \cdot x^{n-1} \mod p\]

where $p$ is an irreducible polynomial over $GF(2)$.

Ask your AI assistant about "Galois fields" and "polynomial arithmetic in GF(2)" to understand the mathematical foundations.

Twenty years later, the Low-Bandwidth File System (LBFS) at MIT became the first major system to use CDC in practice.[2] LBFS slid a 48-byte window across the data and computed a Rabin fingerprint at each position. Whenever the low 13 bits of the fingerprint equaled a magic constant, it declared a chunk boundary, producing an average chunk size of about 8KB. The breakthrough was demonstrating that CDC could achieve dramatic bandwidth savings for real file workloads. Modifying a single paragraph in a large document transmitted only the changed chunk, not the entire file.

1
2
3
4
5
6
// Simplified LBFS boundary check
if ((fingerprint % 8192) == 0x78) {
    // This is a chunk boundary
    emit_chunk(start, current_position);
    start = current_position;
}

The deduplication era of 2005-2015 drove an explosion of CDC research. Many successful systems built deduplication pipelines using CDC based on advances in research that produced faster hash functions, better chunk size distributions, and new ways of finding chunk boundaries. By the mid-2010s, what had been a single technique branched into a family of algorithms with fundamentally different strategies.

A Taxonomy of CDC Algorithms

A comprehensive 2024 survey by Gregoriadis et al.[12] organizes the CDC landscape into three distinct families based on their core mechanism for finding chunk boundaries. This taxonomy clarifies a field that can otherwise feel like a confusing proliferation of acronyms.

	BSW (Basic Sliding Window)	Local Extrema	Statistical
Algorithms	Rabin 1981, Buzhash 1997, Gear 2014, FastCDC 2016, PCI 2020	AE 2015, RAM 2017, MII 2019, VectorCDC 2025	BFBC 2020
Core operation	Rolling hash + mask	Byte comparisons	Frequency analysis
Throughput	Medium–High	High	Medium
Dedup ratio	High	Comparable	Dataset-dependent
SIMD-friendly	Limited	Excellent	Limited
Streaming	Yes	Yes	No (pre-scan)
Chunk distribution	Exponential (improved with NC)	Varies	Varies
Used in practice	Restic, Borg, FastCDC	Research	Research

Taxonomy from Gregoriadis et al. [12]

Algorithmic Timeline

1980

Rabin Fingerprint[1]

BSW

1981

Overview Application Code Analysis

The foundational rolling hash for CDC. Rabin's fingerprint operates over GF(2) (the Galois field with two elements) where all arithmetic reduces to XOR and carry-less multiplication. The key insight: the hash of a sliding window can be updated in O(1) by removing the outgoing byte's contribution and adding the incoming byte's, without recomputing from scratch. This was the first practical rolling hash with provable uniformity: the probability of two distinct k-byte strings colliding is at most k/2^d for an irreducible polynomial of degree d. The polynomial arithmetic makes it slower than later alternatives, but its mathematical foundation remains unmatched.

Used in Restic backup and LBFS.[2] Restic generates a random irreducible polynomial of degree 53 per repository, so that chunk boundaries cannot be predicted from known content. Rabin's provable collision bound -- at most k/2^d for an irreducible polynomial of degree d -- makes it the choice when formal hash uniformity guarantees are needed.

1
2
3
4
5
6
7
8
9
// Rabin fingerprint: rolling hash over GF(2)
uint64_t fp = 0;
for (size_t i = 0; i < len; i++) {
    fp ^= shift_table[window[i % w]];  // remove outgoing byte
    fp = (fp << 8) | data[i];          // shift in new byte
    if (fp & HIGH_BIT) fp ^= poly;     // reduce mod irreducible poly
    window[i % w] = data[i];
    if ((fp % D) == r) return i;       // boundary!
}

Time: O(1) per byte (one XOR to remove, one shift + XOR to add, one polynomial reduction).

Space: O(w + 256), sliding window buffer plus a precomputed byte-shift table.

1985

1990

1995

Buzhash[3]

BSW

1997

Overview Application Code Analysis

Replaces Rabin's polynomial division with a cyclic polynomial where each byte maps to a random value via a lookup table, and the hash is maintained by cyclically rotating (barrel shifting) the current value and XORing in the new byte's table entry. Removing the outgoing byte uses the same table but rotated by the window size. This eliminates the polynomial reduction step entirely: no multiplication, just rotations and XORs. The result is significantly faster than Rabin in practice while providing comparable distribution properties for boundary detection.

Used in Borg backup, which XORs the lookup table with an encrypted per-repository seed to prevent chunk-boundary fingerprinting attacks. Borg notes that Buzhash is used only for boundary detection; a separate cryptographic MAC serves as the deduplication hash. The combination of simple arithmetic (rotations and XORs) with seed-based table randomization makes Buzhash a practical choice when both throughput and boundary-prediction resistance are needed.

1
2
3
4
5
6
7
8
9
10
11
// Buzhash: cyclic polynomial rolling hash
uint32_t table[256];                            // random values, initialized once

uint32_t h = 0;
for (size_t i = 0; i < len; i++) {
    h = ROTATE_LEFT(h, 1);                      // cyclic shift by 1
    h ^= ROTATE_LEFT(table[window[i % w]], w);  // remove outgoing
    h ^= table[data[i]];                        // add incoming
    window[i % w] = data[i];
    if ((h % D) == r) return i;                 // boundary!
}

Time: O(1) per byte, consisting of one table lookup, one rotate, and two XORs.

Space: O(w + 256), window buffer plus the random lookup table.

2000

2005

2010

Gear[4]

BSW

2014

Overview Application Code Analysis

Radically simplifies the rolling hash by eliminating the sliding window entirely. There is no outgoing byte to remove because the hash is purely feedforward. Each step left-shifts the hash by 1 bit and adds a random table lookup for the incoming byte: hash = (hash << 1) + table[byte]. Since older bits naturally shift out of a 64-bit register, the hash is dominated by the most recent ~64 bytes. The insight is that for CDC purposes, you don't need a true sliding window hash; an approximate one where old bytes decay away is sufficient, since boundary decisions are local. One shift + one add gives the tightest inner loop of any CDC hash, roughly 2-3× faster than Buzhash.

Originally the chunking engine within the Ddelta delta compression system,[4] where it demonstrated 2× throughput over Rabin by cutting more than half the per-byte operations. Later adopted by FastCDC as its core hash.[5] The tight inner loop (one shift, one add) also makes Gear straightforward to implement and audit.

1
2
3
4
5
6
7
8
9
// Gear hash: feedforward, no window needed
uint64_t gear_table[256]; // random 64-bit values

uint64_t hash = 0;
for (size_t i = min_size; i < len; i++) {
    hash = (hash << 1) + gear_table[data[i]];
    if ((hash & mask) == 0)
        return i; // boundary!
}

Time: O(1) per byte, consisting of one left-shift, one table lookup, and one addition.

Space: O(256) for the lookup table. No window buffer needed.

2015

AE - Asymmetric Extremum[7]

Extrema

2015

Overview Application Code Analysis

A complete departure from the hash-based lineage. AE finds chunk boundaries by scanning for the maximum byte value within a sliding window of size w. A boundary is declared when the maximum is at the rightmost position of the window. It is called "asymmetric" because the check is one-sided: the max only needs to beat the preceding bytes, not the following ones. This naturally produces chunks whose sizes center around the window size. The approach eliminates all hash computation (no multiplication, no XOR, no table lookups), using only byte comparisons. The trade-off: a naive implementation rescans the entire window for each byte position, giving O(w) per byte, though a monotonic deque can reduce this to O(1) amortized.

Reported 3× throughput over Rabin-based CDC while achieving comparable deduplication ratios on real-world datasets.[7] The simplicity of pure byte comparisons makes AE an important baseline in the local-extrema lineage, and its boundary decisions are inherently SIMD-parallelizable, a property later exploited by VectorCDC.[13]

1
2
3
4
5
6
7
8
9
10
// AE: boundary when max byte is at window's right edge
for (size_t i = min_size; i < len; i++) {
    uint8_t max_val = 0;
    size_t max_pos = 0;
    for (int j = i - w + 1; j <= i; j++) {
        if (data[j] >= max_val)
            { max_val = data[j]; max_pos = j; }
    }
    if (max_pos == i) return i; // boundary!
}

Time: O(w) per byte naive, O(1) amortized with monotonic deque.

Space: O(w) for the sliding window.

FastCDC[5][6]

BSW

2016

Overview Application Code Analysis

Builds directly on Gear (2014) and addresses a fundamental weakness of all single-threshold CDC: the exponential chunk-size distribution that produces many tiny chunks and occasional very large ones. FastCDC's key contribution is Normalized Chunking, a dual-mask strategy that uses a stricter mask (more bits must be zero) for positions below the expected average, and a looser mask (fewer bits) for positions above it. This "squeezes" the distribution toward a bell curve, dramatically improving deduplication by reducing both tiny chunks (which waste metadata) and huge chunks (which reduce sharing). The inner loop remains identical to Gear (one shift, one add, one mask check), so the dual-mask adds zero per-byte overhead. Combined with cut-point skipping (jumping past min_size bytes), FastCDC reported 10× throughput over Rabin-based CDC while matching or exceeding its deduplication ratio.

Reported 10× throughput over Rabin-based CDC and 3× over standalone Gear and AE, while matching or exceeding their deduplication ratios.[5] The 2020 enhancement (rolling two bytes per iteration) added another 30-40% throughput.[6] Adopted as the default chunker in open-source projects including Rdedup, and actively maintained as the fastcdc-rs Rust crate.

1
2
3
4
5
6
7
8
9
10
11
12
// FastCDC: Gear hash + normalized chunking
uint64_t hash = 0;
size_t i = min;
for (; i < avg && i < len; i++) {              // phase 1: strict mask
    hash = (hash << 1) + gear_table[data[i]];
    if (!(hash & mask_s)) return i;
}
for (; i < max && i < len; i++) {              // phase 2: loose mask
    hash = (hash << 1) + gear_table[data[i]];
    if (!(hash & mask_l)) return i;
}
return i;                                      // hit max chunk size

Time: O(1) per byte, identical to Gear. The dual-mask is a branch on position, not a per-byte cost.

Space: O(256) for the Gear table.

RAM - Rapid Asymmetric Maximum[8]

Extrema

2017

Overview Application Code Analysis

Refines AE's extremum approach with a critical performance optimization. RAM uses an asymmetric window: a small lookback (e.g., 256 bytes) and a larger lookahead (roughly the target chunk size). A boundary is declared when the current byte is the maximum of both windows combined. The key insight is the skip optimization: when a byte is not the maximum in the lookahead, the algorithm jumps directly to the position of the actual maximum, bypassing all intermediate positions. This provides sublinear average-case behavior, where bytes examined per boundary is roughly proportional to chunk size, not chunk size times window size. Like AE, RAM uses only byte comparisons with no arithmetic, making it attractive for resource-constrained environments.

Targets cloud storage deduplication, where the skip optimization reduces the number of bytes examined per boundary.[8] When the current byte is not the lookahead maximum, the algorithm jumps directly to the actual maximum, giving sublinear average-case behavior on data with sparse boundaries. Like AE, RAM's boundary decisions are SIMD-parallelizable: VectorCDC's VRAM variant achieves 17× speedup using AVX-512.[13]

1
2
3
4
5
6
7
8
9
10
11
12
13
// RAM: skip to the max, don't scan past it
size_t i = min;
while (i < len) {
    size_t max_pos = i;
    for (size_t j = i+1; j <= i+ahead; j++)       // scan lookahead
        if (data[j] >= data[max_pos]) max_pos = j;
    if (max_pos != i) { i = max_pos; continue; }  // skip!
    bool ok = 1;
    for (size_t j = i-back; j < i; j++)           // check lookback
        if (data[j] > data[i]) { ok = 0; break; }
    if (ok) return i;                             // boundary!
    i++;
}

Time: O(1) amortized per byte due to skip optimization (worst case O(w)).

Space: O(w_back + w_ahead) for the window buffers.

MII - Maximum of Interval-length Independent[9]

Extrema

2019

Overview Application Code Analysis

Builds on AE and RAM but solves a practical problem: in AE/RAM, changing the target chunk size parameters changes which positions are boundary candidates, destroying deduplication against previously stored data. MII decouples the context window from the chunk size parameters. It uses a larger window W (often 2× the target) and identifies all positions that are the maximum of their W-neighborhood as boundary candidates. Separately, it filters these candidates to respect min/max chunk constraints. This "interval-length independent" property means the same byte positions will be candidates regardless of configuration, enabling stable deduplication across different chunk size settings and even multi-resolution deduplication.

Targets incremental backup and data synchronization, where changing chunk-size parameters between backup generations should not invalidate prior deduplication.[9] In benchmarks, MII reduced incremental data by 13-34% compared to other algorithms at comparable throughput. The interval-length independence property also enables multi-resolution deduplication, where different storage tiers can use different chunk granularities while sharing boundary candidates.

1
2
3
4
5
6
7
8
9
10
11
12
// MII: boundary candidates are independent of chunk size
size_t last = 0;
for (size_t i = 0; i < len; i++) {
    bool is_max = 1;
    for (int j = -W/2; j <= W/2; j++)      // large symmetric window
        if (data[i+j] > data[i]) { is_max = 0; break; }
    if (!is_max) continue;
    if (i - last >= min) {                 // filter by min chunk size
        last = i;
        emit_boundary(i);                  // boundary!
    }
}

Time: O(W) per byte naive, O(1) amortized with monotonic deque.

Space: O(W) for the context window, where W > w.

2020

PCI - Popcount Independence[10]

BSW

2020

Overview Application Code Analysis

Takes an unusual approach within the BSW family: instead of computing a hash, PCI counts the number of 1-bits (Hamming weight) in a sliding window of raw bytes. A boundary is declared when the popcount exceeds a threshold θ. Since the popcount of random bytes follows a binomial distribution, the threshold directly controls the average chunk size. What makes this surprisingly practical is hardware support: modern x86 and ARM CPUs have dedicated POPCNT instructions that count bits in a single cycle. No hash tables, no polynomial arithmetic, no random lookup tables. It is just counting bits in the raw data. The sliding window update is also simple: add the incoming byte's popcount, subtract the outgoing byte's.

Designed for incremental data synchronization rather than storage deduplication.[10] PCI's popcount-based boundaries improve resistance to byte-shift propagation: in benchmarks, PCI improved Rsync calculation speed by up to 70% and reduced detected incremental data by 32-57% compared to other CDC algorithms. The trade-off is less uniform chunk-size distribution, making PCI better suited for sync workloads than dedup-ratio-sensitive storage.

1
2
3
4
5
6
7
8
9
// PCI: boundary when bit-population exceeds threshold
int pop_sum = 0;
for (size_t i = 0; i < len; i++) {
    pop_sum += __builtin_popcount(data[i]);       // add incoming
    if (i >= w)
        pop_sum -= __builtin_popcount(data[i-w]); // remove outgoing
    if (i >= min && pop_sum >= threshold)
        return i; // boundary!
}

Time: O(1) per byte, consisting of one hardware POPCNT for the incoming byte and one subtraction for the outgoing.

Space: O(w) for the sliding window buffer.

BFBC - Byte-Frequency Based Chunking[11]

Statistical

2020

Overview Application Code Analysis

A fundamentally different two-pass approach. In the first pass, BFBC scans the data and builds a frequency table of all byte pairs (digrams), then selects the top-k most common pairs. In the second pass, it scans linearly and declares a boundary whenever one of these high-frequency digrams appears (subject to min/max constraints). The insight: common digrams are inherently content-defined and recur consistently regardless of insertions or deletions elsewhere, serving as natural landmarks. Once the frequency table is built, the boundary detection pass is a simple table lookup per position. The fundamental trade-off: the pre-scan makes it unsuitable for streaming, and on high-entropy data (compressed files, encrypted content) the digram frequencies flatten out, destroying the algorithm's ability to find meaningful boundaries.

Reported 10× faster chunking than Rabin-based BSW and 3× faster than TTTD (Two Thresholds Two Divisors).[11] Best suited for batch processing of known file types where the two-pass cost is acceptable. Works well when digram distributions are stable and distinctive, such as structured documents or source code.

1
2
3
4
5
6
7
8
9
10
11
// BFBC Phase 1: build digram frequency table
uint32_t freq[256][256] = {0};
for (size_t i = 0; i + 1 < len; i++)
    freq[data[i]][data[i+1]]++;
bool is_cut[256][256] = select_top_k(freq, k);

// Phase 2: chunk using frequent digrams as boundaries
for (size_t i = min; i < len - 1; i++) {
    if (is_cut[data[i]][data[i+1]])
        return i + 1; // boundary after digram
}

Time: O(n) pre-scan + O(1) per byte for boundary detection.

Space: O(256×256) for the digram frequency table.

2025

VectorCDC[13]

Extrema

2025

Overview Application Code Analysis

Demonstrates that Local Extrema algorithms are inherently SIMD-parallel in a way hash-based algorithms are not. Finding a local maximum across a window of bytes is a parallel comparison, exactly what SSE/AVX packed-max and packed-compare instructions are designed for. VectorCDC loads 16 bytes (SSE) or 32 bytes (AVX2) into a vector register and uses _mm256_max_epu8 to compare all bytes simultaneously, extracting boundary candidates via movemask. Hash-based algorithms resist this because each hash update depends sequentially on the previous one. The VRAM variant (vectorized RAM) achieves 16-42× throughput over scalar implementations, approaching memory bandwidth limits (~10-15 GB/s). Deduplication ratios remain identical since the boundary decisions are mathematically equivalent.

VRAM (vectorized RAM) achieves 6.5-30 GB/s with AVX-512, a 17× speedup over scalar RAM, approaching memory bandwidth limits.[13] Tested across 10 workloads spanning VM backups, database snapshots, source code repositories, and web archives. Because the boundary decisions are mathematically identical to the scalar algorithms, VectorCDC is a drop-in replacement that trades only wider SIMD hardware requirements for higher throughput.

1
2
3
4
5
6
7
8
9
10
11
// VectorCDC: SIMD-accelerated local max (AVX2)
for (; i + 32 <= len; i += 32) {
    __m256i curr = _mm256_loadu_si256(data + i);
    __m256i prev = _mm256_loadu_si256(data + i - 1);
    __m256i next = _mm256_loadu_si256(data + i + 1);
    __m256i is_max = _mm256_and_si256(
        _mm256_cmpeq_epi8(curr, _mm256_max_epu8(curr, prev)),
        _mm256_cmpeq_epi8(curr, _mm256_max_epu8(curr, next)));
    uint32_t mask = _mm256_movemask_epi8(is_max);
    if (mask) return i + __builtin_ctz(mask); // first local max
}

Time: O(1/32) per byte with AVX2, processing 32 bytes per instruction and approaching memory bandwidth.

Space: O(1) beyond the data, requiring only a few vector registers with no tables or buffers.

In the next post, Part 2: A Deep Dive into FastCDC, we’ll take a closer look at the BSW family through FastCDC, an algorithm that combines Gear hashing with normalized chunking to achieve both high throughput and excellent deduplication.

References

[1]

M. O. Rabin, "Fingerprinting by Random Polynomials," Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

PDF Google Scholar

[2]

A. Muthitacharoen, B. Chen & D. Mazieres, "A Low-Bandwidth Network File System," Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), 2001.

PDF ACM DL

[3]

J. D. Cohen, "Recursive Hashing Functions for N-Grams," ACM Transactions on Information Systems, vol. 15, no. 3, pp. 291-320, 1997.

PDF ACM DL

[4]

W. Xia, H. Jiang, D. Feng & L. Tian, "Ddelta: A Deduplication-Inspired Fast Delta Compression Approach," Performance Evaluation, vol. 79, pp. 258-272, 2014.

PDF ScienceDirect

[5]

W. Xia, H. Jiang, D. Feng, L. Tian, M. Fu & Y. Zhou, "FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication," Proceedings of the USENIX Annual Technical Conference (ATC), 2016.

PDF USENIX

[6]

W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu & Y. Zhang, "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 9, pp. 2017-2031, 2020.

PDF IEEE Xplore

[7]

Y. Zhang, H. Jiang, D. Feng, W. Xia, M. Fu, F. Huang & Y. Zhou, "AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication," Proceedings of IEEE INFOCOM, 2015.

PDF IEEE Xplore

[8]

R. N. Widodo, H. Lim & M. Atiquzzaman, "A New Content-Defined Chunking Algorithm for Data Deduplication in Cloud Storage," Future Generation Computer Systems, vol. 71, pp. 145-156, 2017.

PDF ScienceDirect

[9]

C. Zhang, D. Qi, W. Li & J. Guo, "MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization," IEEE Access, vol. 7, pp. 86862-86875, 2019.

PDF IEEE Xplore

[10]

C. Zhang, D. Qi, W. Li & J. Guo, "Function of Content Defined Chunking Algorithms in Incremental Synchronization," IEEE Access, vol. 8, pp. 5316-5330, 2020.

PDF IEEE Xplore

[11]

A. S. M. Saeed & L. E. George, "Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence," Symmetry, vol. 12, no. 11, article 1841, 2020.

PDF MDPI

[12]

M. Gregoriadis, L. Balduf, B. Scheuermann & J. Pouwelse, "A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication," arXiv:2409.06066, 2024.

PDF arXiv

[13]

S. Udayashankar, A. Baba & A. Al-Kiswany, "VectorCDC: Accelerating Data Deduplication with Vector Instructions," Proceedings of the 23rd USENIX Conference on File and Storage Technologies (FAST), 2025.

PDF USENIX

Continue reading → Part 2: A Deep Dive into FastCDC