Chunk Boundary Detection

See how fixed-size chunking vs content-defined chunking handle file modifications.

See in context: Part 1, From Problem to Taxonomy →

Fixed-Size Chunking (48 bytes)

Content-Defined Chunking (sentence boundaries)

Gear Hash Rolling Window

Step through the Gear rolling hash byte-by-byte. Hash = (hash << 1) + GEAR[byte].

See in context: Part 2, A Deep Dive into FastCDC →

Gear Hash in Action

The quick brown fox jumps over the lazy dog. She packed her seven boxes and left. A warm breeze drifted through the open window.

Gear Lookup Table

Each colored block is one of 256 pre-computed random 32-bit values, keyed by byte. Hover a cell to see its mapping.

GEAR[--] = --

Rows 0-1: control bytes · Rows 2-7: printable ASCII · Row 7F: DEL · Rows 8-F: extended bytes

Rolling Hash Window

The hash rolls forward one byte at a time. When it matches a bit pattern, a chunk boundary is placed. Target chunk size: min 8, avg 16, max 32 bytes.

Current Hash: 0x00000000

Speed

Parametric Chunking Explorer

Drag the slider to adjust the target average chunk size and see how FastCDC re-chunks the same text.

See in context: Part 2, A Deep Dive into FastCDC →

Parametric Chunking Explorer

See how target average size affects chunk boundaries and size distribution.

Target Average: 88 bytes (min: 44, max: 264)

Chunk Summary

-- chunks · target -- · avg -- · min -- · max --

Each bar is one chunk. Height and width show relative size (dashed line = target).

Basic vs Normalized Chunk Size Distribution

Compare how single-mask and dual-mask strategies distribute chunk sizes across the same data.

See in context: Part 2, A Deep Dive into FastCDC →

Basic vs Normalized Chunk Size Distribution

Compare how single-mask and dual-mask strategies distribute chunk sizes across the same data.

Target Average: 88 bytes (min: 44, max: 264)

Basic CDC (Single Mask) -- chunks · avg -- · min -- · max --

Each bar is one chunk. Height and width show relative size (dashed line = target).

Density curve: higher peaks mean more chunks of that size. Dashed line marks the target average.

Normalized CDC (Dual Mask) -- chunks · avg -- · min -- · max --

Each bar is one chunk. Height and width show relative size (dashed line = target).

Density curve: higher peaks mean more chunks of that size. Dashed line marks the target average.

Deduplication Explorer

Edit text and save versions to see which chunks are new and which are shared.

See in context: Part 3, Deduplication in Action →

Deduplication Explorer

Click "Save Version" after editing to see which chunks are new and which are shared. Hover over chunks to highlight them across views.

Cost Tradeoffs Explorer

See how average chunk size affects each cost dimension: CPU, memory, network, and storage.

See in context: Part 3, Deduplication in Action →

Cost Tradeoffs Explorer

Average Chunk Size: 32 KB Drag the slider to see how average chunk size affects each cost dimension.

Relative Pressure →

These bars show the direction and shape of each tradeoff, not exact magnitudes. CPU and memory costs scale with chunk count (more chunks = more hashing, larger index). Network cost decreases with smaller chunks because the higher deduplication ratio means less unique data to transfer. Storage has a U-shape: very small chunks incur metadata overhead, while very large chunks reduce deduplication and store more redundant data.

Established Object Storage Provider Cost Explorer

See how per-operation pricing on established object storage providers affects costs when every chunk is a separate object.

See in context: Part 4, CDC in the Cloud →

Established Object Storage Provider Cost Explorer

Average Chunk Size: 32 KB

Established Object Storage Provider Cost Explorer with Containers

See how container packing reduces API operations costs by bundling chunks into larger objects.

See in context: Part 4, CDC in the Cloud →

Established Object Storage Provider Cost Explorer with Containers

Average Chunk Size: 32 KB

Container packing Container Size: 4 MB

Challenger Object Storage Provider Cost Explorer

Explore costs on challenger object storage providers with radically different pricing models.

See in context: Part 5, CDC at Scale on a Budget →

Challenger Object Storage Provider Cost Explorer

Average Chunk Size: 32 KB

Container packing Container Size: 4 MB

Established vs. Challenger Object Storage Provider Cost Comparison

Compare costs across all seven storage providers side by side.

See in context: Part 5, CDC at Scale on a Budget →

Established vs. Challenger Object Storage Provider Cost Comparison

Average Chunk Size: 8 KB

Container Size: 4 MB

Zipf Popularity Distribution

Visualize how skewness affects the popularity distribution of items under a Zipf model.

See in context: Part 5, CDC at Scale on a Budget →

Zipf Popularity Distribution

Skewness (α): 0.60

Cache Size vs. Hit Rate

Given a skewness level and a target hit rate, how much unique data do you need to cache?

See in context: Part 5, CDC at Scale on a Budget →

Cache Size vs. Hit Rate

Skewness (α): 0.60

Target Hit Rate: 50%

Established Cache Provider Cost Explorer

See how established cache providers (ElastiCache, CloudFront) affect origin costs.

See in context: Part 5, CDC at Scale on a Budget →

Established Cache Provider Cost Explorer

Cache Hit Rate: 50%

Provisioned Redis (ElastiCache, Memorystore, Azure Cache) charges for memory regardless of hit rate. CDN edges (CloudFront, Cloud CDN, Azure CDN) charge per-request and per-GB delivered. Both reduce origin GET and egress costs, but the break-even hit rate differs sharply between the two models.

Challenger Cache Provider Cost Explorer

Compare challenger cache providers that scale linearly with per-request pricing.

See in context: Part 5, CDC at Scale on a Budget →

Challenger Cache Provider Cost Explorer

Cache Hit Rate: 50%

Per-request pricing means you pay nothing when the cache is cold and costs scale linearly with usage. Compare the net impact at different hit rates: lower per-read prices (Momento, Workers KV) break even earlier than higher per-read prices (Upstash).

Comprehensive Cost Model

Combine storage provider, cache layer, chunk size, and container packing into a single cost view.

See in context: Part 5, CDC at Scale on a Budget →

Comprehensive Cost Model

Set hit rate to 0% to see costs without caching, matching the Established vs. Challenger Object Storage Provider Cost Comparison above. The matrix at the bottom shows every storage + cache combination. Green highlights the cheapest pairing; terracotta highlights the most expensive.