CDC at Scale on a Budget

Content-Defined Chunking, Part 5 February 26, 2026 | 13 minutes (2454 words)

Part 5 of 5 in a series on Content-Defined Chunking. Previous: Part 4: CDC in the Cloud

Part 4 showed that containers are a prerequisite for CDC on cloud object storage, collapsing per-operation costs by orders of magnitude. But the major providers still charge for every PUT, GET, and byte of egress. Can we do better? A newer generation of S3-compatible services has emerged with pricing models that eliminate or sharply reduce these costs, and caching can cut read expenses further still. This post explores both, then wraps up the series with a look at what motivated this deep dive in the first place.

The Cost Comparison Continued

The dominance of per-operation costs on major cloud providers is what makes container packing essential. But a newer generation of S3-compatible storage services has emerged with pricing models that eliminate or sharply reduce the very cost dimensions that punish CDC. Cloudflare R2 charges zero egress. Backblaze B2 offers free uploads and storage at a fraction of S3’s price. Wasabi charges no per-operation fees and no egress fees at all. The explorer below applies the same workload to these challengers.

Challenger Object Storage Provider Cost Explorer

Average Chunk Size: 32 KB

Container packing Container Size: 4 MB

The savings over established providers are substantial. Each challenger eliminates a different cost dimension: R2 kills egress, B2 offers free uploads and cheap storage, and Wasabi removes both operations and egress fees entirely. The explorer below puts all six providers side by side so you can compare directly.

Established vs. Challenger Object Storage Provider Cost Comparison

Average Chunk Size: 8 KB

Container Size: 4 MB

Egress dominates the established provider bills, and the challengers that eliminate it see the largest absolute savings. Wasabi’s model is the most aggressive: with no per-operation or egress fees, the only cost is storage itself. However, Wasabi’s pricing comes with constraints. There is a 90-day minimum storage duration (deleting data sooner still incurs the full charge), a 1 TB minimum storage volume, and a fair-use egress policy that caps monthly egress at your total storage volume. For read-heavy workloads where egress significantly exceeds stored data, the “free egress” claim may not hold.

Reducing Costs through Caching

The cost explorers above model a direct path: chunks flow from the writer to storage and from storage to the reader. But production systems rarely work that way. A read-through cache between readers and the storage backend can dramatically reduce both operations costs and egress, the two cost dimensions that dominate at scale.

CDC chunks are unusually well-suited for caching. Every chunk is immutable and content-addressed: its hash is its identity. There is no invalidation problem, because a chunk’s content never changes. If chunk a7f3e9... is in the cache, it will be correct forever. And because deduplication shrinks the working set (many files share the same chunks), the effective cache hit rate is higher than it would be for opaque file-based caching. Popular files that share chunks with other popular files all benefit from the same cached data.

A key question for any cache is how much data must be stored to achieve a given hit rate. The answer depends on the access distribution. Breslau et al. showed that web request frequencies follow a Zipf distribution, where the k-th most popular item is accessed with probability proportional to 1/k^α.[37] The α parameter controls how skewed the popularity curve is. At α = 0, every item is equally popular and caching provides no advantage. As α increases, popularity concentrates: a small number of items account for a disproportionate share of requests, which is exactly the condition where caching thrives. Breslau et al. measured α values between 0.64 and 0.83 for web traffic. More recent measurements by Berger et al. on real CDN and web application traces found α values between 0.85 and 1.0, indicating that modern access patterns are even more skewed.[38]

The measure of skewness, α, can be seen by dragging the skewness slider in the visualization below. High α values represent the condition in which a high percentage of requests occur for a few very popular items (high skew), while low α values represent traffic spread more equally across all items (low skew).

In the visualization below, each bar represents an item, like a file or chunk, ranked by popularity as a measure of how frequently it is requested. The height of each bar is its overall percentage share of total requests.

Zipf Popularity Distribution

Skewness (α): 0.60

That skewed distribution is exactly why caching works. If you cache only the most popular items, you can serve a disproportionate share of requests without touching the storage backend. Measured α values for web and CDN traffic typically fall between 0.64 and 1.0, but not all workloads follow a Zipf distribution, and yours may differ. Measuring α for a specific workload is feasible but out of scope for this post; see the footnote^† for pointers on how it’s done. The next visualization shows this relationship directly: given a skewness level and a target hit rate, how much unique data do you actually need to cache?

Cache Size vs. Hit Rate

Skewness (α): 0.60

Target Hit Rate: 50%

Under a Zipf distribution, the cache size needed for a target hit rate h is approximately h^1/(1-α) of the total unique data. The explorers below use α = 0.6 (below the measured range, deliberately conservative), giving a cache fraction of h^2.5. This overstates how much cache capacity is needed: with Berger’s higher α values, real caches would require less data for the same hit rate.

The relationship between hit rate and cache size is worth pausing on, because it is not immediately intuitive. A 50% hit rate means serving half of all requests from cache. Because access patterns are skewed, the most popular 18% of unique data accounts for 50% of all requests – those chunks get hit over and over. To reach a 90% hit rate, you need to also cache the moderately popular long tail, which requires about 77% of unique data. And reaching 99% means caching nearly everything (98%), because that last 9% of requests comes from rarely-accessed chunks that each contribute only a small share of traffic.

The cost impact depends heavily on the pricing model. Established cache providers charge for provisioned capacity: you pay for memory whether it is hit or not. Challenger providers charge per-request: you pay only for the operations you use, with no idle cost.

Established Cache Provider Cost Explorer

Cache Hit Rate: 50%

Provisioned Redis (ElastiCache, Memorystore, Azure Cache) charges for memory regardless of hit rate. CDN edges (CloudFront, Cloud CDN, Azure CDN) charge per-request and per-GB delivered. Both reduce origin GET and egress costs, but the break-even hit rate differs sharply between the two models.

The challenger cache providers invert the cost structure. Instead of provisioning memory upfront, you pay for each cache read (hit) and each cache write (miss that populates the cache). Storage costs, if any, scale with the actual cached data volume.

Challenger Cache Provider Cost Explorer

Cache Hit Rate: 50%

Per-request pricing means you pay nothing when the cache is cold and costs scale linearly with usage. Compare the net impact at different hit rates: lower per-read prices (Momento, Workers KV) break even earlier than higher per-read prices (Upstash).

All Costs Considered

The individual explorers above isolate several cost dimensions: storage provider pricing, per-operation and egress fees, cache provider models, hit rates, and their sensitivity to the Zipf access distribution. Six storage providers, nine cache options, chunk size, and container packing create a large enough configuration space that cost-based decisions are difficult to make by intuition alone. The comprehensive model below puts all of these dimensions into a single view.

Comprehensive Cost Model

The cost landscape has a few clear takeaways. First, provider choice dominates: challenger storage providers with free egress and zero per-operation fees can reduce the monthly bill by 90% or more compared to established providers at the same chunk size and container configuration. Second, caching interacts with provider choice in non-obvious ways. A CDN cache in front of an established provider offloads expensive egress and read operations, producing large absolute savings. But the same cache in front of a free-egress provider like R2 or Wasabi adds cost without offsetting much, because there was little egress cost to begin with. Third, container packing remains essential regardless of provider or cache layer. Without it, per-operation costs at small chunk sizes overwhelm every other line item. The container abstraction from Part 4 is not optional; it is a prerequisite for making CDC economically viable on any cloud object store.

Why I Care About This

This series grew out of my master’s thesis research, where I’m evaluating structure-aware chunking as a deduplication strategy for source code files on large version control platforms. Source code is a particularly interesting domain for chunking because individual files are typically small[27] and edits tend to be localized, small changes concentrated in specific functions or blocks[28]. This means even smaller chunk sizes may be appropriate since the overhead is bounded by the small file sizes involved.

If edits concentrate in specific functions and blocks, the natural extension of content-defined chunking is to define boundaries using the structure of the source code itself: functions, methods, classes, and modules. Instead of using a rolling hash window to identify chunk boundaries, you can parse the code into its syntactic units and chunk along those boundaries directly. cAST (chunking via abstract syntax tree; Zhang et al., 2025)[14] does exactly this in the context of retrieval-augmented code generation (RAG): It parses source code into an AST, recursively splitting large nodes and merging small siblings to produce chunks that align with function, class, and module boundaries. The result is semantically coherent chunks aligned to syntax nodes that improve both retrieval precision and generation quality across diverse programming languages.

My thesis asks whether aligning chunk boundaries to syntactic structures like functions, classes, and modules via AST parsing can outperform byte-level CDC for deduplicating source code across versions on large version control platforms. To understand the benefits cAST offers for source code deduplication, I’m comparing it against whole-file content-addressable storage as a baseline, modeling Git’s approach without its packfile and delta compression layers, and FastCDC v2020 from the BSW family for byte-level content-defined chunking. The comparison spans ten programming languages across hundreds of public repositories with large commit histories, over a range of target chunk sizes, building an empirical cost model for the traditional tradeoffs between system resources (CPU and memory), network, and storage. The cost model should help clarify the tradeoffs across system resources, network, and storage, and evaluate when the additional overhead of AST parsing adds value and how much along each cost axis, compared to byte-level chunking and whole-file content-addressable storage.

Conclusion

Every solution at one layer of abstraction creates problems at the next. Content-defined chunking solves fixed-size chunking’s fatal sensitivity to insertions and deletions by letting data determine its own boundaries. But deploying CDC at scale reveals that the chunking algorithm is only the beginning. Chunk size controls a web of interacting costs: storage efficiency, per-operation pricing, network egress, CPU overhead, and memory pressure. Containers decouple logical chunk granularity from physical object count, but at the cost of fragmentation, complex garbage collection, and a design space where container size, rewriting strategy, and GC policy all interact. Caching under Zipf access patterns can dramatically reduce read and egress costs, but the savings depend on provider pricing models that vary by orders of magnitude.

The deeper insight running through this series is that CDC’s power comes from its modularity. The chunking layer and the storage layer are cleanly separated. A chunking algorithm does not need to know whether its output will be stored as individual objects or packed into containers, cached at the edge or served from origin, priced per-operation or per-gigabyte. This separation of concerns is why the same core idea from Rabin’s 1981 fingerprinting still works in a 2025 cloud storage system with container packing, locality-preserved caching, and piggybacked GC-defragmentation. Each layer can evolve independently, and both have.

That modularity also points toward where the field goes next. Structure-aware chunking, like cAST’s use of abstract syntax trees for source code, raises an obvious question: what other domains have exploitable structure? Document formats, configuration files, database snapshots, and serialized protocol buffers all have internal structure that byte-level chunking ignores. On the performance side, VectorCDC’s SIMD acceleration shows that hardware-aware algorithm design can push throughput far beyond what scalar implementations achieve, and as instruction sets widen further, the gap will only grow. Beyond text and code, deduplication for images, video, and other binary formats remains largely unexplored territory where content-aware boundary detection could take entirely different forms.

Perhaps the most consequential open question is what role deduplication plays in the AI revolution ahead. Retrieval-augmented generation systems depend on chunking strategies that balance retrieval precision against chunk coherence. Model checkpointing, distributed training state, and inference caching all generate enormous volumes of partially redundant data. As AI workloads continue to scale, the economics of storing and transferring deduplicated data will only become more critical. The algorithm that lets data decide its own boundaries may have its most important work still ahead.

^† The α parameter is measured empirically by fitting real access logs to the Zipf model. The procedure is: collect access traces, rank items by frequency (most popular = rank 1), and plot log(rank) vs. log(frequency). If the access pattern follows a Zipf distribution, this log-log plot is approximately linear, and the slope of that line is −α. A steeper slope means more skewed popularity. Breslau et al. and Berger et al. both used this fitting approach on web traffic and CDN traces to arrive at their measured α ranges.

References

[14]

Y. Zhang, X. Zhao, Z. Z. Wang, C. Yang, J. Wei & T. Wu, "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree," arXiv:2506.15655, 2025.

arXiv

[27]

I. Herraiz, D. M. German & A. E. Hassan, "On the Distribution of Source Code File Sizes," 6th International Conference on Software and Data Technologies (ICSOFT '11), 2011.

ResearchGate

[28]

O. Arafat & D. Riehle, "The Commit Size Distribution of Open Source Software," 42nd Hawaii International Conference on System Sciences (HICSS-42), 2009.

IEEE

[37]

L. Breslau, P. Cao, L. Fan, G. Phillips & S. Shenker, "Web Caching and Zipf-like Distributions: Evidence and Implications," IEEE INFOCOM '99, 1999.

IEEE

[38]

D. S. Berger, N. Beckmann & M. Harchol-Balter, "Practical Bounds on Optimal Caching with Variable Object Sizes," Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Vol. 2, No. 2, 2018.

ACM

← Part 4: CDC in the Cloud · Back to Part 1: From Problem to Taxonomy