Content-Defined Chunking

Research bibliography for the CDC post series. Papers are organized by topic. Where available, PDFs are hosted locally for convenience.

← All references

Foundations

M. O. Rabin, "Fingerprinting by Random Polynomials," Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

Introduced polynomial fingerprinting over GF(2) as a rolling hash with provable collision bounds. The foundational technique behind all BSW-family CDC algorithms.

PDF Google Scholar

A. Muthitacharoen, B. Chen & D. Mazieres, "A Low-Bandwidth Network File System," Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), 2001.

First major system to use CDC in practice. LBFS used a 48-byte sliding window with Rabin fingerprints to achieve dramatic bandwidth savings for file synchronization.

PDF ACM DL

J. D. Cohen, "Recursive Hashing Functions for N-Grams," ACM Transactions on Information Systems, vol. 15, no. 3, pp. 291-320, 1997.

Introduced the cyclic polynomial (Buzhash) rolling hash. Replaces Rabin's polynomial division with barrel shifts and XORs, significantly improving throughput while maintaining comparable distribution properties.

PDF ACM DL

BSW Family

W. Xia, H. Jiang, D. Feng & L. Tian, "Ddelta: A Deduplication-Inspired Fast Delta Compression Approach," Performance Evaluation, vol. 79, pp. 258-272, 2014.

Introduced the Gear hash, a feedforward rolling hash that eliminates the sliding window entirely. Demonstrated 2x throughput over Rabin by reducing per-byte operations from 7 (Rabin) to 3 (one shift, one add, one table lookup).

PDF ScienceDirect

W. Xia, H. Jiang, D. Feng, L. Tian, M. Fu & Y. Zhou, "FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication," Proceedings of the USENIX Annual Technical Conference (ATC), 2016.

Combined Gear hashing with normalized chunking (dual-mask strategy) and cut-point skipping. Reported 10x throughput over Rabin-based CDC and 3x over standalone Gear and AE, while matching or exceeding deduplication ratios.

PDF USENIX

W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu & Y. Zhang, "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 9, pp. 2017-2031, 2020.

Extended FastCDC with a "rolling two bytes per iteration" optimization, adding 30-40% throughput over the 2016 version. Adopted as the default chunker in open-source projects including Rdedup.

PDF IEEE Xplore

C. Zhang, D. Qi, W. Li & J. Guo, "Function of Content Defined Chunking Algorithms in Incremental Synchronization," IEEE Access, vol. 8, pp. 5316-5330, 2020.

Introduced PCI (Popcount Independence), which uses the Hamming weight of a sliding window for boundary detection. Targets incremental synchronization: improved Rsync calculation speed by up to 70% and reduced detected incremental data by 32-57%.

PDF IEEE Xplore

Local Extrema Family

Y. Zhang, H. Jiang, D. Feng, W. Xia, M. Fu, F. Huang & Y. Zhou, "AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication," Proceedings of IEEE INFOCOM, 2015.

First hashless CDC algorithm. Finds boundaries by locating the maximum byte value in an asymmetric local range. Reported 3x throughput over Rabin-based CDC while achieving comparable deduplication ratios.

PDF IEEE Xplore

R. N. Widodo, H. Lim & M. Atiquzzaman, "A New Content-Defined Chunking Algorithm for Data Deduplication in Cloud Storage," Future Generation Computer Systems, vol. 71, pp. 145-156, 2017.

Introduced RAM (Rapid Asymmetric Maximum) with a skip optimization that jumps directly to the actual maximum when the current byte is not the maximum, giving sublinear average-case behavior.

PDF ScienceDirect

C. Zhang, D. Qi, W. Li & J. Guo, "MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization," IEEE Access, vol. 7, pp. 86862-86875, 2019.

Decouples the context window from chunk size parameters, so boundary candidates remain stable when parameters change. Reduced incremental data by 13-34% compared to other algorithms, targeting backup and data synchronization workloads.

PDF IEEE Xplore

S. Udayashankar, A. Baba & A. Al-Kiswany, "VectorCDC: Accelerating Data Deduplication with Vector Instructions," Proceedings of the 23rd USENIX Conference on File and Storage Technologies (FAST), 2025.

Demonstrated that local-extrema algorithms are inherently SIMD-parallelizable. VRAM (vectorized RAM) achieved 6.5-30 GB/s with AVX-512, a 17x speedup over scalar RAM. Tested across 10 workloads from VM backups to web archives.

PDF USENIX

Statistical Family

A. S. M. Saeed & L. E. George, "Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence," Symmetry, vol. 12, no. 11, article 1841, 2020.

Two-pass algorithm that builds a digram frequency table and uses the most common byte pairs as chunk boundaries. Reported 10x faster chunking than Rabin-based BSW and 3x faster than TTTD.

PDF MDPI

Survey

M. Gregoriadis, L. Balduf, B. Scheuermann & J. Pouwelse, "A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication," arXiv:2409.06066, 2024.

Comprehensive survey organizing CDC algorithms into three families (BSW, Local Extrema, Statistical). Provides the taxonomic framework used in the CDC post series and includes systematic benchmarks across algorithms and datasets.

PDF arXiv

Deduplication Systems

B. Zhu, K. Li & H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," 6th USENIX Conference on File and Storage Technologies (FAST '08), San Jose, CA, February 2008.

Introduced the container abstraction for CDC chunk storage and the Summary Vector index for deduplication lookup without per-chunk disk I/O. The foundation for most modern deduplication storage architectures.

PDF

W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang & Y. Zhou, "A Comprehensive Study of the Past, Present, and Future of Data Deduplication," Proceedings of the IEEE, vol. 104, no. 9, pp. 1681-1710, September 2016.

Broad survey of deduplication techniques covering chunking, indexing, storage, and optimization. Provides context for how CDC fits into full deduplication system architectures.

IEEE Xplore

M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise & P. Camble, "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," 7th USENIX Conference on File and Storage Technologies (FAST '09), San Jose, CA, February 2009.

Introduced sparse indexing for deduplication at scale by sampling a subset of chunk fingerprints rather than maintaining a complete index. Exploits locality to achieve high deduplication with much lower memory overhead.

USENIX

D. T. Meyer & W. J. Bolosky, "A Study of Practical Deduplication," 9th USENIX Conference on File and Storage Technologies (FAST '11), San Jose, CA, February 2011.

Large-scale empirical study of deduplication across 857 desktop systems. Showed that cross-user deduplication yields modest gains relative to single-user, and that most deduplication comes from file-level rather than sub-file matches.

PDF

H. Wu, C. Wang, K. Lu, Y. Fu & L. Zhu, "One Size Does Not Fit All: The Case for Chunking Configuration in Backup Deduplication," 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '18), 2018.

Studied the impact of chunk size configuration on deduplication effectiveness across different backup workloads. Argues that no single chunk size is optimal for all data types.

IEEE Xplore

Containers, Fragmentation & GC

M. Kaczmarczyk, M. Barczynski, W. Kilian & C. Dubnicki, "Reducing Impact of Data Fragmentation Caused by In-line Deduplication," SYSTOR '12, Haifa, Israel, June 2012.

First systematic study of fragmentation in deduplication systems. Introduced the concept of forward assembly areas and capping to limit the read amplification caused by scattered chunk references across containers.

ACM DL

M. Lillibridge, K. Eshghi & D. Bhagwat, "Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication," 11th USENIX Conference on File and Storage Technologies (FAST '13), San Jose, CA, February 2013.

Analyzed restore performance degradation as backup generations accumulate. Proposed container capping and read-ahead strategies to improve restore throughput for deduplicated backups.

USENIX

M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang & Y. Tan, "Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information," USENIX ATC '14, Philadelphia, PA, June 2014.

Used historical backup metadata to predict future access patterns, improving both restore speed and garbage collection efficiency in deduplication systems.

USENIX

M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang & Y. Tan, "Design Tradeoffs for Data Deduplication Performance in Backup Workloads," 13th USENIX Conference on File and Storage Technologies (FAST '15), Santa Clara, CA, February 2015.

Systematic study of design tradeoffs in deduplication systems: container size, index structure, and rewriting strategies. Showed how each parameter affects deduplication ratio, backup throughput, and restore speed.

USENIX

F. Douglis, A. Duggal, P. Shilane, T. Wong, S. Yan & F. Botelho, "The Logic of Physical Garbage Collection in Deduplicating Storage," 15th USENIX Conference on File and Storage Technologies (FAST '17), Santa Clara, CA, February 2017.

Detailed analysis of garbage collection in the Data Domain system. Describes the challenges of reclaiming space when chunks are shared across containers and backup generations.

USENIX

X. Zou et al., "The Dilemma between Deduplication and Locality: Can Both be Achieved?" 19th USENIX Conference on File and Storage Technologies (FAST '21), February 2021.

Directly addresses the fundamental tension between deduplication ratio and data locality. Proposes techniques for improving restore performance without sacrificing deduplication effectiveness.

USENIX

D. Liu et al., "Garbage Collection Does Not Only Collect Garbage: Piggybacking-Style Defragmentation for Deduplicated Backup Storage," EuroSys '25, 2025.

Combines garbage collection with defragmentation by rewriting live chunks into locality-optimized containers during GC passes. Improves restore speed as a side effect of reclaiming space.

ACM DL

Cloud Storage

A. Duggal, F. Jenkins, P. Shilane, R. Chinthekindi, R. Shah & M. Kamat, "Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere!" USENIX ATC '19, Renton, WA, July 2019.

Describes the engineering of moving deduplicated data to cloud object storage while preserving deduplication benefits. Addresses the specific challenges of per-operation cloud pricing for small-object workloads.

USENIX

Caching & Access Patterns

L. Breslau, P. Cao, L. Fan, G. Phillips & S. Shenker, "Web Caching and Zipf-like Distributions: Evidence and Implications," IEEE INFOCOM '99, 1999.

Showed that web request frequencies follow a Zipf distribution. Measured α values between 0.64 and 0.83 for web traffic, establishing the empirical basis for cache sizing models.

IEEE Xplore

D. S. Berger, N. Beckmann & M. Harchol-Balter, "Practical Bounds on Optimal Caching with Variable Object Sizes," Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), vol. 2, no. 2, 2018.

Found α values between 0.85 and 1.0 on real CDN and web application traces, indicating modern access patterns are even more skewed than Breslau et al. measured. Provides practical bounds for cache sizing.

ACM DL

File & Commit Statistics

I. Herraiz, D. M. German & A. E. Hassan, "On the Distribution of Source Code File Sizes," 6th International Conference on Software and Data Technologies (ICSOFT '11), 2011.

Empirical study of source code file size distributions across open-source projects. Provides data used for modeling CDC workloads on source code repositories.

ResearchGate

O. Arafat & D. Riehle, "The Commit Size Distribution of Open Source Software," 42nd Hawaii International Conference on System Sciences (HICSS-42), 2009.

Analyzed commit size distributions in open-source projects. Provides empirical grounding for modeling change rates and deduplication effectiveness in version-controlled codebases.

IEEE Xplore

Code-Aware Chunking

Y. Zhang, X. Zhao, Z. Z. Wang, C. Yang, J. Wei & T. Wu, "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree," arXiv:2506.15655, 2025.

Uses abstract syntax trees to chunk source code at structural boundaries rather than content-defined boundaries. Explores the frontier of structure-aware chunking for code retrieval and RAG applications.

arXiv

Y. Collet, N. Terrell, W. F. Handte, D. Rozenblit, V. Zhang, K. Zhang, Y. Goldschlag, J. Lee, E. Gorokhovsky, Y. Komornik, D. Riegel, S. Angelov & N. Rotem, "OpenZL: A Graph-Based Model for Compression," arXiv:2510.03203, October 2025.

Proposes a graph-based model for compression that can represent and optimize across different compression strategies including deduplication.

arXiv Project

Deployments & Tools

N. Koorapati, "Streaming File Synchronization," Dropbox Tech Blog, July 2014.

Describes Dropbox's use of CDC for streaming file synchronization, reducing bandwidth by transmitting only changed chunks.

Dropbox Tech Blog

R. Jain & D. R. Horn, "Broccoli: Syncing Faster by Syncing Less," Dropbox Tech Blog, August 2020.

Describes Dropbox's evolution of their CDC-based sync protocol, reducing sync overhead by intelligently skipping unchanged regions.

Dropbox Tech Blog

Y. Cui, Z. Lai, N. Dai & X. Wang, "QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services," IEEE Transactions on Mobile Computing, vol. 16, no. 12, pp. 3513-3526, 2017.

Applies CDC to mobile cloud storage synchronization, addressing the specific constraints of mobile bandwidth and battery life.

IEEE Xplore

Seafile Ltd., "Data Model," Seafile Administration Manual.

Documents Seafile's CDC implementation for file synchronization and deduplication in their open-source cloud storage platform.

Docs GitHub

Riot Games Technology, "Supercharging Data Delivery: The New League Patcher," 2019.

Describes Riot Games' move from binary deltas to FastCDC-based content-defined chunking for League of Legends game updates.

Riot Games

IPFS Documentation, "File Systems: Chunking."

Describes IPFS's use of Rabin fingerprinting for content-defined chunking alongside fixed-size splitting in content-addressed distributed storage.

Docs GitHub

A. Neumann, "Restic Foundation - Content Defined Chunking," Restic Blog, September 2015.

Describes Restic's use of Rabin fingerprint-based CDC for deduplication in its backup tool.

Blog GitHub

BorgBackup Contributors, "Internals: Data Structures," BorgBackup Documentation.

Describes Borg's Buzhash-based chunker with a keyed seed for chunk boundary detection.

Docs GitHub

Kopia Contributors, "Architecture," Kopia Documentation.

Describes Kopia's rolling-hash file splitting for content-addressable deduplication in its backup system.

Docs GitHub

G. Chen, "Variable-size Chunking," Duplicacy Design Document.

Describes Duplicacy's variable-size CDC with configurable minimum, average, and maximum chunk sizes.

Wiki GitHub

A. Chambers, "Technical Overview," Bupstash Documentation.

Describes Bupstash's GearHash-based content-defined chunking for encrypted, deduplicated backups.

Docs GitHub

C. Percival, "How Tarsnap Deduplication Works," Tarsnap Documentation.

Describes Tarsnap's context-dependent variable-size chunking for client-side deduplication.

Docs

L. Poettering, "casync: Content-Addressable Data Synchronization Tool," 2017.

Uses CDC to split filesystem images into variable-size chunks for efficient OS and container image distribution.

GitHub LWN

Part 1: From Problem to Taxonomy Part 2: A Deep Dive into FastCDC Part 3: Deduplication in Action Part 4: CDC in the Cloud Part 5: CDC at Scale on a Budget