A Deep Dive into FastCDC
Content-Defined Chunking, Part 2 February 09, 2026 | 20 minutes (3677 words)Part 1 introduced the deduplication problem, showed why fixed-size chunking fails, and surveyed three CDC algorithm families: Basic Sliding Window (BSW), Local Extrema, and Statistical. This post focuses on FastCDC, the most widely adopted BSW algorithm, exploring its Gear hash, normalized chunking strategy, and tunable parameters through interactive demos.
A Closer Look at BSW via FastCDC
The Basic Sliding Window family includes FastCDC, which is true to its name: it is fast. Where Rabin used polynomial arithmetic and Buzhash used cyclic shifts, FastCDC’s Gear hash strips the rolling hash down to its simplest possible form. That speed, combined with normalized chunking for tighter chunk-size distributions, has made FastCDC one of the most widely implemented CDC algorithms today, with mature libraries in Rust, Go, Python, Java, and C++. We’ll explore both the 2016[5] and 2020[6] versions in detail, using the fastcdc-rs Rust crate as our reference implementation.
The GEAR Hash
At FastCDC’s core is the Gear hash, a rolling hash reduced to two operations. For each byte, you:
- Left-shift the current hash by one bit, dropping the most significant bit
- Add the value from the GEAR table keyed by the current byte, a pre-computed 64-bit random value, to the hash
That’s it. No XOR with outgoing bytes, no polynomial division. Just shift and add.
The visualization below uses 32-bit values for compactness. The real FastCDC implementation uses 64-bit GEAR table entries and a 64-bit hash accumulator, as shown in the code samples that follow. The algorithm works identically at either width – only the bit count and mask positions change. To internalize the algorithm, try advancing the animation one step at a time and observe the Gear table lookup, the rolling hash manipulation, and the binary addition at each position. Playing the animation at full speed is also useful for seeing the overall flow, but stepping through it frame by frame is the best way to build intuition.
Each colored block is one of 256 pre-computed random 32-bit values, keyed by byte. Hover a cell to see its mapping.
The hash rolls forward one byte at a time. When it matches a bit pattern, a chunk boundary is placed. Target chunk size: min 8, avg 16, max 32 bytes.
The GEAR table maps each of the 256 possible byte values to a pre-computed 64-bit random number:
1
2
3
4
5
6
7
8
// GEAR table: 256 random 64-bit values
GEAR[256] = generate_random_table()
function gear_hash(data, start, end):
hash = 0
for i from start to end:
hash = (hash << 1) + GEAR[data[i]]
return hash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// From fastcdc-rs v2020
const GEAR: [u64; 256] = [
0x3b5d3c7d207e37dc, 0x784d68ba91123086,
0xcd52880f882e7298, 0xecc4917415d5c696,
// ... 252 more values
];
fn gear_hash(data: &[u8]) -> u64 {
let mut hash: u64 = 0;
for &byte in data {
hash = hash.wrapping_shl(1)
.wrapping_add(GEAR[byte as usize]);
}
hash
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// GEAR table (first few values shown)
const GEAR: bigint[] = [
0x3b5d3c7d207e37dcn, 0x784d68ba91123086n,
0xcd52880f882e7298n, 0xecc4917415d5c696n,
// ... 252 more values
];
function gearHash(data: Uint8Array): bigint {
let hash = 0n;
for (const byte of data) {
hash = ((hash << 1n) + GEAR[byte]) & 0xFFFFFFFFFFFFFFFFn;
}
return hash;
}
The simplicity of this hash is the point. A single left-shift and a single addition per byte gives the GEAR hash its speed advantage over Rabin (which requires polynomial division) and Buzhash (which requires XOR with both the incoming and outgoing byte). But a fast hash is only half the story. The other half is deciding when the hash value signals a chunk boundary.
Finding Chunk Boundaries
With the GEAR hash updating for each byte, how do we decide where to cut?
The classic approach: check if the low N bits of the hash are zero. If we want an average chunk size of 8KB, we check if hash & 0x1FFF == 0 (the low 13 bits).
Why does this work? The GEAR hash produces pseudo-random values, so the probability that any N bits are all zero is $1/2^N$. For 13 bits, that’s $1/2^{13} = 1/8192$, meaning on average, one in every 8,192 bytes triggers a boundary. The mask is the chunk size control: more bits mean larger average chunks, fewer bits mean smaller ones.
This is the heart of every BSW algorithm. The algorithm doesn’t search for patterns in the content directly. Instead, it feeds each byte through the GEAR table, lets the rolling hash mix the values together, and checks whether certain bits of the result happen to be zero. The content determines the hash, the mask selects which bits to check, and a boundary is placed wherever those bits are all zero.
But basic masking has a problem, and FastCDC does something more clever: normalized chunking with dual masks.
The problem with basic masking is chunk sizes follow an exponential distribution. You get many small chunks and occasional very large ones. This hurts deduplication because small chunks increase metadata overhead, and large chunks reduce sharing opportunities.
Why exponential? Because the probability of hitting a boundary is the same at every byte position. Each byte has a $1/2^N$ chance of triggering a cut, regardless of how many bytes have already been consumed into the current chunk. Short chunks are always more likely than long ones, for the same reason that flipping heads on the first try is more likely than waiting until the tenth.
The fix is intuitive: vary the probability based on how many bytes have been consumed into the current chunk so far. If only a few bytes have been consumed, make boundaries harder to find so you don’t produce tiny chunks. Once you’ve consumed past the target average, make boundaries easier to find so chunks don’t grow too large.
FastCDC’s solution:
- Below the average size: Use a stricter mask (more bits must be zero)
- Above the average size: Use a looser mask (fewer bits must be zero)
FastCDC was published in two rounds: a 2016 paper that introduced normalized chunking, and a 2020 revision that added a performance optimization by processing two bytes per iteration. Let’s walk through both versions to see how the dual-mask idea translates into concrete code, and how the algorithm evolved between the two papers.
The 2016 Algorithm
This is the version illustrated above: it processes one byte at a time, shifting and adding through the GEAR table exactly as we saw in the Gear Hash in Action animation, and switches between the strict and loose mask depending on how far into the current chunk it has consumed. Here’s the complete loop:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function find_chunk_boundary(data, min_size, avg_size, max_size):
// Skip to minimum size (cut-point skipping)
position = min_size
hash = 0
// Calculate masks based on average size
bits = log2(avg_size)
mask_s = MASKS[bits + 1] // Stricter (1 more bit)
mask_l = MASKS[bits - 1] // Looser (1 fewer bit)
// Phase 1: Strict mask until average size
while position < avg_size AND position < data.length:
hash = (hash << 1) + GEAR[data[position]]
if (hash & mask_s) == 0:
return position // Found boundary!
position += 1
// Phase 2: Loose mask until maximum size
while position < max_size AND position < data.length:
hash = (hash << 1) + GEAR[data[position]]
if (hash & mask_l) == 0:
return position // Found boundary!
position += 1
// Hit maximum size without finding boundary
return min(max_size, data.length)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// From fastcdc-rs v2016 module
pub fn cut(
source: &[u8],
min_size: usize,
avg_size: usize,
max_size: usize,
mask_s: u64,
mask_l: u64,
) -> usize {
let mut remaining = source.len();
if remaining <= min_size {
return remaining;
}
let mut center = avg_size;
if remaining > max_size {
remaining = max_size;
} else if remaining < center {
center = remaining;
}
let mut index = min_size;
let mut hash: u64 = 0;
// Phase 1: strict mask until center (average size)
while index < center {
hash = (hash << 1).wrapping_add(GEAR[source[index] as usize]);
if (hash & mask_s) == 0 {
return index;
}
index += 1;
}
// Phase 2: loose mask until max_size
while index < remaining {
hash = (hash << 1).wrapping_add(GEAR[source[index] as usize]);
if (hash & mask_l) == 0 {
return index;
}
index += 1;
}
remaining
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
function findChunkBoundary(
data: Uint8Array,
minSize: number,
avgSize: number,
maxSize: number,
): number {
if (data.length <= minSize) {
return data.length;
}
const bits = Math.floor(Math.log2(avgSize));
const maskS = MASKS[bits + 1]; // Stricter
const maskL = MASKS[bits - 1]; // Looser
let remaining = Math.min(data.length, maxSize);
let center = Math.min(avgSize, remaining);
let index = minSize;
let hash = 0n;
// Phase 1: strict mask until center
while (index < center) {
hash = ((hash << 1n) + GEAR[data[index]]) & MASK_64;
if ((hash & maskS) === 0n) {
return index;
}
index++;
}
// Phase 2: loose mask until max
while (index < remaining) {
hash = ((hash << 1n) + GEAR[data[index]]) & MASK_64;
if ((hash & maskL) === 0n) {
return index;
}
index++;
}
return remaining;
}
The 2016 algorithm’s one-byte sliding window is already fast, but the 2020 revision found a way to cut the number of loop iterations in half.
The 2020 Enhancement: Rolling Two Bytes
The 2020 paper introduces a simple optimization: process two adjacent bytes per step as the sliding window moves across the data. Since two consecutive single-bit shifts are equivalent to one two-bit shift, the hash updates for two adjacent bytes can be collapsed:
Instead of two separate steps:
hash = (hash << 1) + GEAR[byte1] // step 1
hash = (hash << 1) + GEAR[byte2] // step 2
Collapse into one:
hash = (hash << 2) + GEAR_LS[byte1] // Left-shifted GEAR table
hash = hash + GEAR[byte2] // Regular GEAR table
The boundary check still happens after each byte, but each check uses a different mask. After the first byte, the hash bits sit one position higher than they would in the 2016 algorithm, so a left-shifted mask (mask_s_ls) is needed to check the equivalent bits. After the second byte, the bits realign with the original positions, so the regular mask (mask_s) is used. The results are identical to processing one byte at a time.
Where does the speedup come from? Each time the sliding window advances one step, the CPU must increment the position, evaluate whether the window has reached a size limit, and branch back to process the next position. By processing two bytes per step, this bookkeeping happens half as often. The two single-bit shifts also collapse into one two-bit shift instruction. These savings are small per step, but across millions of bytes they add up.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Pre-compute left-shifted GEAR table
GEAR_LS[i] = GEAR[i] << 1 // for all i
function find_chunk_boundary_2020(data, min_size, avg_size, max_size):
position = min_size / 2 // Start at half (we process 2 bytes/iter)
hash = 0
// Phase 1: Strict mask, two bytes at a time
while position < avg_size / 2:
byte_pos = position * 2
// First byte: shift by 2, add left-shifted value
hash = (hash << 2) + GEAR_LS[data[byte_pos]]
if (hash & mask_s_ls) == 0:
return byte_pos
// Second byte: add regular value
hash = hash + GEAR[data[byte_pos + 1]]
if (hash & mask_s) == 0:
return byte_pos + 1
position += 1
// Phase 2: Loose mask (similar structure)
// ... same pattern with mask_l
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// From fastcdc-rs v2020 module
pub fn cut_gear(
source: &[u8],
min_size: usize,
avg_size: usize,
max_size: usize,
mask_s: u64,
mask_l: u64,
mask_s_ls: u64, // Left-shifted strict mask
mask_l_ls: u64, // Left-shifted loose mask
gear: &[u64; 256],
gear_ls: &[u64; 256], // Left-shifted GEAR table
) -> (u64, usize) {
let mut remaining = source.len();
if remaining <= min_size {
return (0, remaining);
}
let mut center = avg_size;
if remaining > max_size {
remaining = max_size;
} else if remaining < center {
center = remaining;
}
let mut index = min_size / 2;
let mut hash: u64 = 0;
// Phase 1: strict mask, two bytes per iteration
while index < center / 2 {
let a = index * 2;
// First byte
hash = (hash << 2).wrapping_add(gear_ls[source[a] as usize]);
if (hash & mask_s_ls) == 0 {
return (hash, a);
}
// Second byte
hash = hash.wrapping_add(gear[source[a + 1] as usize]);
if (hash & mask_s) == 0 {
return (hash, a + 1);
}
index += 1;
}
// Phase 2: loose mask (same pattern)
while index < remaining / 2 {
let a = index * 2;
hash = (hash << 2).wrapping_add(gear_ls[source[a] as usize]);
if (hash & mask_l_ls) == 0 {
return (hash, a);
}
hash = hash.wrapping_add(gear[source[a + 1] as usize]);
if (hash & mask_l) == 0 {
return (hash, a + 1);
}
index += 1;
}
(hash, remaining)
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Pre-computed left-shifted table
const GEAR_LS: bigint[] = GEAR.map(g => (g << 1n) & MASK_64);
function findChunkBoundary2020(
data: Uint8Array,
minSize: number,
avgSize: number,
maxSize: number,
): number {
if (data.length <= minSize) {
return data.length;
}
const bits = Math.floor(Math.log2(avgSize));
const maskS = MASKS[bits + 1];
const maskL = MASKS[bits - 1];
const maskSLs = maskS << 1n; // Left-shifted masks
const maskLLs = maskL << 1n;
let remaining = Math.min(data.length, maxSize);
let center = Math.min(avgSize, remaining);
let index = Math.floor(minSize / 2);
let hash = 0n;
// Phase 1: strict mask, two bytes at a time
while (index < Math.floor(center / 2)) {
const a = index * 2;
// First byte
hash = ((hash << 2n) + GEAR_LS[data[a]]) & MASK_64;
if ((hash & maskSLs) === 0n) return a;
// Second byte
hash = (hash + GEAR[data[a + 1]]) & MASK_64;
if ((hash & maskS) === 0n) return a + 1;
index++;
}
// Phase 2: loose mask
while (index < Math.floor(remaining / 2)) {
const a = index * 2;
hash = ((hash << 2n) + GEAR_LS[data[a]]) & MASK_64;
if ((hash & maskLLs) === 0n) return a;
hash = (hash + GEAR[data[a + 1]]) & MASK_64;
if ((hash & maskL) === 0n) return a + 1;
index++;
}
return remaining;
}
Both versions produce the same chunk boundaries for the same input and parameters. The difference is purely mechanical: the 2020 version reaches those boundaries faster by doing two bytes of work per loop iteration. With the algorithm itself understood, the next question is practical: how do the parameters you choose actually affect the chunks that come out?
Exploring the Parameters
The target average chunk size is the primary parameter when configuring FastCDC. A smaller average means more chunks (better deduplication granularity but more metadata), while a larger average means fewer chunks (less overhead but coarser deduplication). Drag the slider below to see how FastCDC re-chunks the same text at different target sizes:
See how target average size affects chunk boundaries and size distribution.
Notice how the dual-mask strategy keeps chunk sizes clustered around the target average. At small targets (16-32 bytes) you get many chunks, and the distribution chart reveals the normalized shape. At large targets (96-128 bytes) the entire text may collapse into just a few chunks.
To see why normalization matters, consider what a single mask does. Basic CDC checks the same bit pattern from minimum size all the way to maximum size. Each byte after the minimum has an independent 1/avgSize probability of triggering a boundary. Because the algorithm cuts at the first match, most chunks end early: the chance of reaching any given byte without a match drops exponentially. This produces a geometric distribution skewed toward small chunks, with a few chunks surviving long enough to reach the maximum size. The FastCDC paper addresses this with normalization levels (NC1 through NC3), which control how aggressively the two masks differ from the base probability. At NC2 (the paper’s recommended level), the strict mask is 4x harder to trigger than the single-mask baseline, suppressing early cuts below the target average, while the loose mask is 4x easier, catching chunks shortly after they pass it. The result is a tight cluster around the target rather than a skewed spread.
Compare the two approaches below. Both chunk the same 8 KB of pseudo-random bytes (generated from a fixed seed), using the same Gear hash and the same target parameters. The only difference is how the mask is applied. Random data makes the statistical properties of each algorithm clearly visible because natural language text has too much structure to reveal the distribution shapes at this scale.
The density curve beneath each chunked block view shows the distribution of chunk sizes: the horizontal axis is chunk size in bytes, the vertical axis is how likely a chunk of that size is, and the dashed line marks the target average. A tall, narrow peak means most chunks land near the same size; a long tail trailing to the right means many chunks end up much larger than the target:
Compare how single-mask and dual-mask strategies distribute chunk sizes across the same data.
Basic CDC’s single mask produces chunks that follow an exponential distribution: many small chunks and a long tail of large ones. FastCDC’s dual-mask normalization clusters chunks tightly around the target average, reducing both extremes. This narrower distribution means less wasted metadata on tiny chunks and fewer oversized chunks that dilute deduplication.
FastCDC gives us a chunking algorithm that is fast, produces well-distributed chunk sizes, and, most importantly, generates stable boundaries that survive local edits. But chunking is only the first stage of a deduplication pipeline. Once data is split into chunks, each chunk needs a cryptographic fingerprint, those fingerprints need to be indexed and looked up efficiently, and duplicate chunks need to be eliminated during storage or transmission. The choices made at each stage (hash function, index structure, storage layout) interact with the chunking layer in ways that matter for real-world performance.
In the next post, Part 3: Deduplication in Action, we’ll build on FastCDC to walk through the deduplication pipeline end to end: fingerprinting chunks, detecting duplicates, and examining the cost tradeoffs that shape how these systems perform in practice.