Foundations
M. O. Rabin, "Fingerprinting by Random Polynomials," Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
Introduced polynomial fingerprinting over GF(2) as a rolling hash with provable collision bounds. The foundational technique behind all BSW-family CDC algorithms.
A. Muthitacharoen, B. Chen & D. Mazieres, "A Low-Bandwidth Network File System," Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), 2001.
First major system to use CDC in practice. LBFS used a 48-byte sliding window with Rabin fingerprints to achieve dramatic bandwidth savings for file synchronization.
J. D. Cohen, "Recursive Hashing Functions for N-Grams," ACM Transactions on Information Systems, vol. 15, no. 3, pp. 291-320, 1997.
Introduced the cyclic polynomial (Buzhash) rolling hash. Replaces Rabin's polynomial division with barrel shifts and XORs, significantly improving throughput while maintaining comparable distribution properties.
BSW Family
W. Xia, H. Jiang, D. Feng & L. Tian, "Ddelta: A Deduplication-Inspired Fast Delta Compression Approach," Performance Evaluation, vol. 79, pp. 258-272, 2014.
Introduced the Gear hash, a feedforward rolling hash that eliminates the sliding window entirely. Demonstrated 2x throughput over Rabin by reducing per-byte operations from 7 (Rabin) to 3 (one shift, one add, one table lookup).
W. Xia, H. Jiang, D. Feng, L. Tian, M. Fu & Y. Zhou, "FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication," Proceedings of the USENIX Annual Technical Conference (ATC), 2016.
Combined Gear hashing with normalized chunking (dual-mask strategy) and cut-point skipping. Reported 10x throughput over Rabin-based CDC and 3x over standalone Gear and AE, while matching or exceeding deduplication ratios.
W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu & Y. Zhang, "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 9, pp. 2017-2031, 2020.
Extended FastCDC with a "rolling two bytes per iteration" optimization, adding 30-40% throughput over the 2016 version. Adopted as the default chunker in open-source projects including Rdedup.
C. Zhang, D. Qi, W. Li & J. Guo, "Function of Content Defined Chunking Algorithms in Incremental Synchronization," IEEE Access, vol. 8, pp. 5316-5330, 2020.
Introduced PCI (Popcount Independence), which uses the Hamming weight of a sliding window for boundary detection. Targets incremental synchronization: improved Rsync calculation speed by up to 70% and reduced detected incremental data by 32-57%.
Local Extrema Family
Y. Zhang, H. Jiang, D. Feng, W. Xia, M. Fu, F. Huang & Y. Zhou, "AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication," Proceedings of IEEE INFOCOM, 2015.
First hashless CDC algorithm. Finds boundaries by locating the maximum byte value in an asymmetric local range. Reported 3x throughput over Rabin-based CDC while achieving comparable deduplication ratios.
R. N. Widodo, H. Lim & M. Atiquzzaman, "A New Content-Defined Chunking Algorithm for Data Deduplication in Cloud Storage," Future Generation Computer Systems, vol. 71, pp. 145-156, 2017.
Introduced RAM (Rapid Asymmetric Maximum) with a skip optimization that jumps directly to the actual maximum when the current byte is not the maximum, giving sublinear average-case behavior.
C. Zhang, D. Qi, W. Li & J. Guo, "MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization," IEEE Access, vol. 7, pp. 86862-86875, 2019.
Decouples the context window from chunk size parameters, so boundary candidates remain stable when parameters change. Reduced incremental data by 13-34% compared to other algorithms, targeting backup and data synchronization workloads.
S. Udayashankar, A. Baba & A. Al-Kiswany, "VectorCDC: Accelerating Data Deduplication with Vector Instructions," Proceedings of the 23rd USENIX Conference on File and Storage Technologies (FAST), 2025.
Demonstrated that local-extrema algorithms are inherently SIMD-parallelizable. VRAM (vectorized RAM) achieved 6.5-30 GB/s with AVX-512, a 17x speedup over scalar RAM. Tested across 10 workloads from VM backups to web archives.
Statistical Family
A. S. M. Saeed & L. E. George, "Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence," Symmetry, vol. 12, no. 11, article 1841, 2020.
Two-pass algorithm that builds a digram frequency table and uses the most common byte pairs as chunk boundaries. Reported 10x faster chunking than Rabin-based BSW and 3x faster than TTTD.
Survey
M. Gregoriadis, L. Balduf, B. Scheuermann & J. Pouwelse, "A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication," arXiv:2409.06066, 2024.
Comprehensive survey organizing CDC algorithms into three families (BSW, Local Extrema, Statistical). Provides the taxonomic framework used in the CDC post series and includes systematic benchmarks across algorithms and datasets.
Deduplication Systems
B. Zhu, K. Li & H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," 6th USENIX Conference on File and Storage Technologies (FAST '08), San Jose, CA, February 2008.
Introduced the container abstraction for CDC chunk storage and the Summary Vector index for deduplication lookup without per-chunk disk I/O. The foundation for most modern deduplication storage architectures.
W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang & Y. Zhou, "A Comprehensive Study of the Past, Present, and Future of Data Deduplication," Proceedings of the IEEE, vol. 104, no. 9, pp. 1681-1710, September 2016.
Broad survey of deduplication techniques covering chunking, indexing, storage, and optimization. Provides context for how CDC fits into full deduplication system architectures.
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise & P. Camble, "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," 7th USENIX Conference on File and Storage Technologies (FAST '09), San Jose, CA, February 2009.
Introduced sparse indexing for deduplication at scale by sampling a subset of chunk fingerprints rather than maintaining a complete index. Exploits locality to achieve high deduplication with much lower memory overhead.
D. T. Meyer & W. J. Bolosky, "A Study of Practical Deduplication," 9th USENIX Conference on File and Storage Technologies (FAST '11), San Jose, CA, February 2011.
Large-scale empirical study of deduplication across 857 desktop systems. Showed that cross-user deduplication yields modest gains relative to single-user, and that most deduplication comes from file-level rather than sub-file matches.
H. Wu, C. Wang, K. Lu, Y. Fu & L. Zhu, "One Size Does Not Fit All: The Case for Chunking Configuration in Backup Deduplication," 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '18), 2018.
Studied the impact of chunk size configuration on deduplication effectiveness across different backup workloads. Argues that no single chunk size is optimal for all data types.
Containers, Fragmentation & GC
M. Kaczmarczyk, M. Barczynski, W. Kilian & C. Dubnicki, "Reducing Impact of Data Fragmentation Caused by In-line Deduplication," SYSTOR '12, Haifa, Israel, June 2012.
First systematic study of fragmentation in deduplication systems. Introduced the concept of forward assembly areas and capping to limit the read amplification caused by scattered chunk references across containers.
M. Lillibridge, K. Eshghi & D. Bhagwat, "Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication," 11th USENIX Conference on File and Storage Technologies (FAST '13), San Jose, CA, February 2013.
Analyzed restore performance degradation as backup generations accumulate. Proposed container capping and read-ahead strategies to improve restore throughput for deduplicated backups.
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang & Y. Tan, "Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information," USENIX ATC '14, Philadelphia, PA, June 2014.
Used historical backup metadata to predict future access patterns, improving both restore speed and garbage collection efficiency in deduplication systems.
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang & Y. Tan, "Design Tradeoffs for Data Deduplication Performance in Backup Workloads," 13th USENIX Conference on File and Storage Technologies (FAST '15), Santa Clara, CA, February 2015.
Systematic study of design tradeoffs in deduplication systems: container size, index structure, and rewriting strategies. Showed how each parameter affects deduplication ratio, backup throughput, and restore speed.
F. Douglis, A. Duggal, P. Shilane, T. Wong, S. Yan & F. Botelho, "The Logic of Physical Garbage Collection in Deduplicating Storage," 15th USENIX Conference on File and Storage Technologies (FAST '17), Santa Clara, CA, February 2017.
Detailed analysis of garbage collection in the Data Domain system. Describes the challenges of reclaiming space when chunks are shared across containers and backup generations.
X. Zou et al., "The Dilemma between Deduplication and Locality: Can Both be Achieved?" 19th USENIX Conference on File and Storage Technologies (FAST '21), February 2021.
Directly addresses the fundamental tension between deduplication ratio and data locality. Proposes techniques for improving restore performance without sacrificing deduplication effectiveness.
D. Liu et al., "Garbage Collection Does Not Only Collect Garbage: Piggybacking-Style Defragmentation for Deduplicated Backup Storage," EuroSys '25, 2025.
Combines garbage collection with defragmentation by rewriting live chunks into locality-optimized containers during GC passes. Improves restore speed as a side effect of reclaiming space.
Cloud Storage
A. Duggal, F. Jenkins, P. Shilane, R. Chinthekindi, R. Shah & M. Kamat, "Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere!" USENIX ATC '19, Renton, WA, July 2019.
Describes the engineering of moving deduplicated data to cloud object storage while preserving deduplication benefits. Addresses the specific challenges of per-operation cloud pricing for small-object workloads.
Caching & Access Patterns
L. Breslau, P. Cao, L. Fan, G. Phillips & S. Shenker, "Web Caching and Zipf-like Distributions: Evidence and Implications," IEEE INFOCOM '99, 1999.
Showed that web request frequencies follow a Zipf distribution. Measured α values between 0.64 and 0.83 for web traffic, establishing the empirical basis for cache sizing models.
D. S. Berger, N. Beckmann & M. Harchol-Balter, "Practical Bounds on Optimal Caching with Variable Object Sizes," Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), vol. 2, no. 2, 2018.
Found α values between 0.85 and 1.0 on real CDN and web application traces, indicating modern access patterns are even more skewed than Breslau et al. measured. Provides practical bounds for cache sizing.
File & Commit Statistics
I. Herraiz, D. M. German & A. E. Hassan, "On the Distribution of Source Code File Sizes," 6th International Conference on Software and Data Technologies (ICSOFT '11), 2011.
Empirical study of source code file size distributions across open-source projects. Provides data used for modeling CDC workloads on source code repositories.
O. Arafat & D. Riehle, "The Commit Size Distribution of Open Source Software," 42nd Hawaii International Conference on System Sciences (HICSS-42), 2009.
Analyzed commit size distributions in open-source projects. Provides empirical grounding for modeling change rates and deduplication effectiveness in version-controlled codebases.
Code-Aware Chunking
Y. Zhang, X. Zhao, Z. Z. Wang, C. Yang, J. Wei & T. Wu, "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree," arXiv:2506.15655, 2025.
Uses abstract syntax trees to chunk source code at structural boundaries rather than content-defined boundaries. Explores the frontier of structure-aware chunking for code retrieval and RAG applications.
Y. Collet, N. Terrell, W. F. Handte, D. Rozenblit, V. Zhang, K. Zhang, Y. Goldschlag, J. Lee, E. Gorokhovsky, Y. Komornik, D. Riegel, S. Angelov & N. Rotem, "OpenZL: A Graph-Based Model for Compression," arXiv:2510.03203, October 2025.
Proposes a graph-based model for compression that can represent and optimize across different compression strategies including deduplication.
Deployments & Tools
N. Koorapati, "Streaming File Synchronization," Dropbox Tech Blog, July 2014.
Describes Dropbox's use of CDC for streaming file synchronization, reducing bandwidth by transmitting only changed chunks.
R. Jain & D. R. Horn, "Broccoli: Syncing Faster by Syncing Less," Dropbox Tech Blog, August 2020.
Describes Dropbox's evolution of their CDC-based sync protocol, reducing sync overhead by intelligently skipping unchanged regions.
Y. Cui, Z. Lai, N. Dai & X. Wang, "QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services," IEEE Transactions on Mobile Computing, vol. 16, no. 12, pp. 3513-3526, 2017.
Applies CDC to mobile cloud storage synchronization, addressing the specific constraints of mobile bandwidth and battery life.
Seafile Ltd., "Data Model," Seafile Administration Manual.
Documents Seafile's CDC implementation for file synchronization and deduplication in their open-source cloud storage platform.
Riot Games Technology, "Supercharging Data Delivery: The New League Patcher," 2019.
Describes Riot Games' move from binary deltas to FastCDC-based content-defined chunking for League of Legends game updates.
IPFS Documentation, "File Systems: Chunking."
Describes IPFS's use of Rabin fingerprinting for content-defined chunking alongside fixed-size splitting in content-addressed distributed storage.
A. Neumann, "Restic Foundation - Content Defined Chunking," Restic Blog, September 2015.
Describes Restic's use of Rabin fingerprint-based CDC for deduplication in its backup tool.
BorgBackup Contributors, "Internals: Data Structures," BorgBackup Documentation.
Describes Borg's Buzhash-based chunker with a keyed seed for chunk boundary detection.
Kopia Contributors, "Architecture," Kopia Documentation.
Describes Kopia's rolling-hash file splitting for content-addressable deduplication in its backup system.
G. Chen, "Variable-size Chunking," Duplicacy Design Document.
Describes Duplicacy's variable-size CDC with configurable minimum, average, and maximum chunk sizes.
A. Chambers, "Technical Overview," Bupstash Documentation.
Describes Bupstash's GearHash-based content-defined chunking for encrypted, deduplicated backups.
C. Percival, "How Tarsnap Deduplication Works," Tarsnap Documentation.
Describes Tarsnap's context-dependent variable-size chunking for client-side deduplication.
L. Poettering, "casync: Content-Addressable Data Synchronization Tool," 2017.
Uses CDC to split filesystem images into variable-size chunks for efficient OS and container image distribution.