<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Rick Winfrey - Writings</title>
    <description>Writings by Rick Winfrey</description>
    <link>https://rickwinfrey.com</link>
    <atom:link href="https://rickwinfrey.com/feed.xml" rel="self" type="application/rss+xml"/>
    <lastBuildDate>Thu, 05 Mar 2026 08:45:21 +0000</lastBuildDate>
    
    <item>
      <title>CDC at Scale on a Budget</title>
      <link>https://rickwinfrey.com/writings/content-defined-chunking-part-5</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/content-defined-chunking-part-5</guid>
      <pubDate>Thu, 26 Feb 2026 12:00:00 +0000</pubDate>
      
      <description>Cloud object storage can be expensive for CDC at scale. This post explores cost-saving alternatives: challenger storage providers with radically different pricing, and the role caching plays under Zipf access patterns to drive costs down further.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<style>
/* ==========================================================================
   CDC Animation Styles
/* Cloud cost table */
.cost-cloud-section {
  padding: 0 1.25rem 0.5rem;
  border-top: 1px solid rgba(61, 58, 54, 0.1);
  margin-top: 0.5rem;
  padding-top: 1rem;
}

.cost-cloud-header {
  display: flex;
  flex-wrap: wrap;
  justify-content: space-between;
  align-items: baseline;
  gap: 0.5rem;
  margin-bottom: 0.75rem;
}

.cost-cloud-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: bold;
  color: #3d3a36;
}

.cost-cloud-workload {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem;
  color: #8b7355;
}

.cost-cloud-table {
  width: 100%;
  border-collapse: collapse;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  margin-bottom: 0.75rem;
}

.cost-cloud-table th {
  font-weight: bold;
  color: #3d3a36;
  text-align: right;
  padding: 0.35rem 0.5rem;
  border-bottom: 2px solid rgba(61, 58, 54, 0.15);
}

.cost-cloud-table th:first-child {
  text-align: left;
}

.cost-cloud-table td {
  padding: 0.35rem 0.5rem;
  text-align: right;
  color: #3d3a36;
  border-bottom: 1px solid rgba(61, 58, 54, 0.07);
}

.cost-cloud-table td:first-child {
  text-align: left;
  color: #8b7355;
}

/* Override global _layout.scss last-child rules */
.cost-cloud-table td:last-child,
.cost-cloud-table th:last-child {
  border-left: 1px solid #cfcfcf;
  font-weight: inherit;
  background-color: inherit;
}

.cost-cloud-table.cost-has-totals tr:last-child td {
  font-weight: bold;
  border-top: 2px solid rgba(61, 58, 54, 0.15);
  border-bottom: none;
}

.cost-cell-value {
  display: block;
}

.cost-cell-calc {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #a08b6e;
  line-height: 1.3;
}

.cost-cloud-assumptions {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.65rem;
  color: #8b7355;
  line-height: 1.5;
}

.cost-cloud-assumptions a {
  color: #c45a3b;
}

.cost-pricing-ref {
  margin: 0.5rem 0 0.75rem;
}
.cost-pricing-ref summary {
  cursor: pointer;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  color: #8b7355;
  list-style: none;
  padding: 0.25rem 0;
}
.cost-pricing-ref summary::-webkit-details-marker { display: none; }
.cost-pricing-ref summary::marker { display: none; }
.cost-pricing-ref summary::before {
  content: "\25B8  ";
  display: inline-block;
  transition: transform 0.2s;
}
.cost-pricing-ref[open] summary::before {
  transform: rotate(90deg);
}
.cost-ref-table td {
  font-weight: normal;
}
.cost-ref-table tr:last-child td {
  border-top: none;
  border-bottom: 1px solid rgba(61, 58, 54, 0.07);
}

/* Container packing toggle */
.container-toggle {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  cursor: pointer;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  color: #3d3a36;
}

.container-toggle input[type="checkbox"] {
  -webkit-appearance: none;
  appearance: none;
  width: 36px;
  height: 20px;
  background: rgba(61, 58, 54, 0.2);
  border-radius: 10px;
  position: relative;
  cursor: pointer;
  transition: background 0.2s ease;
  flex-shrink: 0;
}

.container-toggle input[type="checkbox"]::after {
  content: '';
  position: absolute;
  top: 2px;
  left: 2px;
  width: 16px;
  height: 16px;
  background: #fff;
  border-radius: 50%;
  transition: transform 0.2s ease;
  box-shadow: 0 1px 3px rgba(0,0,0,0.2);
}

.container-toggle input[type="checkbox"]:checked {
  background: #2d7a4f;
}

.container-toggle input[type="checkbox"]:checked::after {
  transform: translateX(16px);
}

/* Savings row highlight */
.container-savings-row td {
  font-weight: bold !important;
  border-top: 2px solid rgba(45, 122, 79, 0.2) !important;
  border-bottom: none !important;
}

/* Provider comparison table (transposed) */
.provider-avg-row td {
  font-weight: bold !important;
  border-top: 2px solid rgba(61, 58, 54, 0.2) !important;
  border-bottom: 2px solid rgba(61, 58, 54, 0.2) !important;
}

.provider-newcomer-row td:last-child .cost-cell-calc {
  color: #2d7a4f;
  font-weight: 600;
}

/* Zipf cache visualization */
.zipf-chart-container {
  padding: 0.5rem 1.25rem 0;
}

.zipf-chart-container canvas {
  width: 100%;
  display: block;
}

.zipf-readout {
  text-align: center;
  padding: 0.75rem 1.25rem;
  font-weight: 600;
  color: #2d7a4f;
  font-size: 0.95rem;
}

/* Footnote */
.cdc-fn-ref {
  text-decoration: none;
  color: #a89b8c;
  font-size: 0.8rem;
}

.cdc-footnote {
  font-size: 0.85rem;
  color: #6b6560;
  border-top: 1px solid rgba(61, 58, 54, 0.15);
  padding-top: 1rem;
  margin-top: 2rem;
  margin-bottom: 2rem;
  line-height: 1.6;
}

/* Comprehensive cost model: cache table + matrix spacing */
.comprehensive-cache-header {
  margin-top: 1.5rem;
  padding-top: 1rem;
  border-top: 1px solid rgba(61, 58, 54, 0.1);
}

#comprehensive-cost-section {
  border-top: none;
  margin-top: 0;
  padding-top: 0;
}

.cost-matrix-scroll {
  overflow-x: auto;
  -webkit-overflow-scrolling: touch;
}

.cost-matrix-table {
  min-width: 72rem;
}

.cost-matrix-table th:first-child,
.cost-matrix-table td:first-child {
  position: sticky;
  left: 0;
  background: #faf9f7;
  z-index: 1;
}

.cost-matrix-cell {
  text-align: center;
  min-width: 6.5rem;
  transition: background-color 0.2s ease, color 0.2s ease;
}

.cost-matrix-cell .cost-cell-value {
  display: block;
}

.cost-matrix-cell .cost-cell-calc {
  display: block;
  white-space: nowrap;
}

/* Styled provider select dropdowns */
.cost-provider-select {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  color: #3d3a36;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.2);
  border-radius: 6px;
  padding: 0.35rem 2rem 0.35rem 0.6rem;
  appearance: none;
  -webkit-appearance: none;
  background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='10' height='6'%3E%3Cpath d='M0 0l5 6 5-6z' fill='%238b7355'/%3E%3C/svg%3E");
  background-repeat: no-repeat;
  background-position: right 0.6rem center;
  cursor: pointer;
  transition: border-color 0.2s ease;
}

.cost-provider-select:hover {
  border-color: rgba(61, 58, 54, 0.4);
}

.cost-provider-select:focus {
  outline: none;
  border-color: #c45a3b;
  box-shadow: 0 0 0 2px rgba(196, 90, 59, 0.15);
}

.cost-provider-select:disabled {
  opacity: 0.4;
  cursor: not-allowed;
}

@media (max-width: 42em) {
  .cost-cloud-table {
    font-size: 0.65rem;
  }
  .cost-cloud-table th,
  .cost-cloud-table td {
    padding: 0.25rem 0.3rem;
  }
  .cost-cell-calc {
    display: none;
  }
}
</style>

<div class="cdc-series-nav">
Part 5 of 5 in a series on Content-Defined Chunking. Previous: <a href="/writings/content-defined-chunking-part-4">Part 4: CDC in the Cloud</a>
</div>

<p><a href="/writings/content-defined-chunking-part-4">Part 4</a> showed that containers are a prerequisite for CDC on cloud object storage, collapsing per-operation costs by orders of magnitude. But the major providers still charge for every PUT, GET, and byte of egress. Can we do better? A newer generation of S3-compatible services has emerged with pricing models that eliminate or sharply reduce these costs, and caching can cut read expenses further still. This post explores both, then wraps up the series with a look at what motivated this deep dive in the first place.</p>

<hr />

<h3 id="the-cost-comparison-continued">The Cost Comparison Continued</h3>

<p>The dominance of per-operation costs on major cloud providers is what makes container packing essential. But a newer generation of S3-compatible storage services has emerged with pricing models that eliminate or sharply reduce the very cost dimensions that punish CDC. <a href="https://www.cloudflare.com/developer-platform/products/r2/">Cloudflare R2</a> charges zero egress. <a href="https://www.backblaze.com/cloud-storage">Backblaze B2</a> offers free uploads and storage at a fraction of S3’s price. <a href="https://wasabi.com/">Wasabi</a> charges no per-operation fees and no egress fees at all. The explorer below applies the same workload to these challengers.</p>

<div class="cdc-viz" id="newcomer-cost-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Challenger Object Storage Provider Cost Explorer</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Average Chunk Size: <strong id="newcomer-cost-chunk-value">32 KB</strong>
    </span>
    <input type="range" id="newcomer-cost-chunk-slider" min="0" max="100" value="50" step="1" />
  </div>
  <div class="parametric-control-row">
    <label class="container-toggle">
      <input type="checkbox" id="newcomer-cost-packing-toggle" checked="" />
      <span>Container packing</span>
    </label>
    <span class="parametric-control-label">
      Container Size: <strong id="newcomer-cost-container-value">4 MB</strong>
    </span>
    <input type="range" id="newcomer-cost-container-slider" min="1" max="64" value="4" step="1" />
  </div>
  <div class="cost-cloud-section" id="newcomer-cost-cloud-section">
  </div>
</div>

<p>The savings over established providers are substantial. Each challenger eliminates a different cost dimension: R2 kills egress, B2 offers free uploads and cheap storage, and Wasabi removes both operations and egress fees entirely. The explorer below puts all six providers side by side so you can compare directly.</p>

<div class="cdc-viz" id="provider-comparison-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Established vs. Challenger Object Storage Provider Cost Comparison</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Average Chunk Size: <strong id="provider-comparison-chunk-value">8 KB</strong>
    </span>
    <input type="range" id="provider-comparison-chunk-slider" min="0" max="100" value="30" step="1" />
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Container Size: <strong id="provider-comparison-container-value">4 MB</strong>
    </span>
    <input type="range" id="provider-comparison-container-slider" min="1" max="64" value="4" step="1" />
  </div>
  <div class="cost-cloud-section" id="provider-comparison-cloud-section">
  </div>
</div>

<p>Egress dominates the established provider bills, and the challengers that eliminate it see the largest absolute savings. Wasabi’s model is the most aggressive: with no per-operation or egress fees, the only cost is storage itself. However, Wasabi’s pricing comes with constraints. There is a 90-day minimum storage duration (deleting data sooner still incurs the full charge), a 1 TB minimum storage volume, and a fair-use egress policy that caps monthly egress at your total storage volume. For read-heavy workloads where egress significantly exceeds stored data, the “free egress” claim may not hold.</p>

<h3 id="reducing-costs-through-caching">Reducing Costs through Caching</h3>

<p>The cost explorers above model a direct path: chunks flow from the writer to storage and from storage to the reader. But production systems rarely work that way. A read-through cache between readers and the storage backend can dramatically reduce both operations costs and egress, the two cost dimensions that dominate at scale.</p>

<p>CDC chunks are unusually well-suited for caching. Every chunk is immutable and content-addressed: its hash <em>is</em> its identity. There is no invalidation problem, because a chunk’s content never changes. If chunk <code class="language-plaintext highlighter-rouge">a7f3e9...</code> is in the cache, it will be correct forever. And because deduplication shrinks the working set (many files share the same chunks), the effective cache hit rate is higher than it would be for opaque file-based caching. Popular files that share chunks with other popular files all benefit from the same cached data.</p>

<p>A key question for any cache is how much data must be stored to achieve a given hit rate. The answer depends on the access distribution. Breslau et al. showed that web request frequencies follow a Zipf distribution, where the <em>k</em>-th most popular item is accessed with probability proportional to 1/<em>k</em><sup><em>α</em></sup>.<span class="cdc-cite"><a href="#ref-37">[37]</a></span> The <em>α</em> parameter controls how skewed the popularity curve is. At <em>α</em> = 0, every item is equally popular and caching provides no advantage. As <em>α</em> increases, popularity concentrates: a small number of items account for a disproportionate share of requests, which is exactly the condition where caching thrives. Breslau et al. measured <em>α</em> values between 0.64 and 0.83 for web traffic. More recent measurements by Berger et al. on real CDN and web application traces found <em>α</em> values between 0.85 and 1.0, indicating that modern access patterns are even more skewed.<span class="cdc-cite"><a href="#ref-38">[38]</a></span></p>

<p>The measure of skewness, <em>α</em>, can be seen by dragging the skewness slider in the visualization below. High <em>α</em> values represent the condition in which a high percentage of requests occur for a few very popular items (high skew), while low <em>α</em> values represent traffic spread more equally across all items (low skew).</p>

<p>In the visualization below, each bar represents an item, like a file or chunk, ranked by popularity as a measure of how frequently it is requested. The height of each bar is its overall percentage share of total requests.</p>

<div class="cdc-viz" id="zipf-distribution-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Zipf Popularity Distribution</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Skewness (&alpha;): <strong id="zipf-dist-alpha-value">0.60</strong>
    </span>
    <input type="range" id="zipf-dist-alpha-slider" min="0" max="100" value="60" step="1" />
  </div>
  <div class="zipf-chart-container">
    <canvas id="zipf-dist-canvas"></canvas>
  </div>
  <div class="zipf-readout" id="zipf-dist-readout"></div>
</div>

<p>That skewed distribution is exactly why caching works. If you cache only the most popular items, you can serve a disproportionate share of requests without touching the storage backend. Measured <em>α</em> values for web and CDN traffic typically fall between 0.64 and 1.0, but not all workloads follow a Zipf distribution, and yours may differ. Measuring <em>α</em> for a specific workload is feasible but out of scope for this post; see the footnote<sup><a href="#fn-alpha" class="cdc-fn-ref">†</a></sup> for pointers on how it’s done. The next visualization shows this relationship directly: given a skewness level and a target hit rate, how much unique data do you actually need to cache?</p>

<div class="cdc-viz" id="zipf-cache-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Cache Size vs. Hit Rate</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Skewness (&alpha;): <strong id="zipf-cache-alpha-value">0.60</strong>
    </span>
    <input type="range" id="zipf-cache-alpha-slider" min="0" max="100" value="60" step="1" />
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Target Hit Rate: <strong id="zipf-cache-hitrate-value">50%</strong>
    </span>
    <input type="range" id="zipf-cache-hitrate-slider" min="1" max="99" value="50" step="1" />
  </div>
  <div class="zipf-chart-container">
    <canvas id="zipf-cache-canvas"></canvas>
  </div>
  <div class="zipf-readout" id="zipf-cache-readout"></div>
</div>

<p>Under a Zipf distribution, the cache size needed for a target hit rate <em>h</em> is approximately <em>h</em><sup>1/(1-<em>α</em>)</sup> of the total unique data. The explorers below use <em>α</em> = 0.6 (below the measured range, deliberately conservative), giving a cache fraction of <em>h</em><sup>2.5</sup>. This overstates how much cache capacity is needed: with Berger’s higher <em>α</em> values, real caches would require less data for the same hit rate.</p>

<p>The relationship between hit rate and cache size is worth pausing on, because it is not immediately intuitive. A 50% hit rate means serving half of all <em>requests</em> from cache. Because access patterns are skewed, the most popular 18% of unique data accounts for 50% of all requests – those chunks get hit over and over. To reach a 90% hit rate, you need to also cache the moderately popular long tail, which requires about 77% of unique data. And reaching 99% means caching nearly everything (98%), because that last 9% of requests comes from rarely-accessed chunks that each contribute only a small share of traffic.</p>

<p>The cost impact depends heavily on the pricing model. Established cache providers charge for provisioned capacity: you pay for memory whether it is hit or not. Challenger providers charge per-request: you pay only for the operations you use, with no idle cost.</p>

<div class="cdc-viz" id="cache-traditional-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Established Cache Provider Cost Explorer</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Cache Hit Rate: <strong id="cache-traditional-hit-value">50%</strong>
    </span>
    <input type="range" id="cache-traditional-hit-slider" min="0" max="99" value="50" step="1" />
  </div>
  <div class="cost-cloud-section" id="cache-traditional-section">
  </div>
  <div class="cdc-viz-hint">
    Provisioned Redis (ElastiCache, Memorystore, Azure Cache) charges for memory regardless of hit rate. CDN edges (CloudFront, Cloud CDN, Azure CDN) charge per-request and per-GB delivered. Both reduce origin GET and egress costs, but the break-even hit rate differs sharply between the two models.
  </div>
</div>

<p>The challenger cache providers invert the cost structure. Instead of provisioning memory upfront, you pay for each cache read (hit) and each cache write (miss that populates the cache). Storage costs, if any, scale with the actual cached data volume.</p>

<div class="cdc-viz" id="cache-newcomer-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Challenger Cache Provider Cost Explorer</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Cache Hit Rate: <strong id="cache-newcomer-hit-value">50%</strong>
    </span>
    <input type="range" id="cache-newcomer-hit-slider" min="0" max="99" value="50" step="1" />
  </div>
  <div class="cost-cloud-section" id="cache-newcomer-section">
  </div>
  <div class="cdc-viz-hint">
    Per-request pricing means you pay nothing when the cache is cold and costs scale linearly with usage. Compare the net impact at different hit rates: lower per-read prices (Momento, Workers KV) break even earlier than higher per-read prices (Upstash).
  </div>
</div>

<h3 id="all-costs-considered">All Costs Considered</h3>

<p>The individual explorers above isolate several cost dimensions: storage provider pricing, per-operation and egress fees, cache provider models, hit rates, and their sensitivity to the Zipf access distribution. Six storage providers, nine cache options, chunk size, and container packing create a large enough configuration space that cost-based decisions are difficult to make by intuition alone. The comprehensive model below puts all of these dimensions into a single view.</p>

<div class="cdc-viz" id="comprehensive-cost-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Comprehensive Cost Model</span>
  </div>
  <div class="cost-cloud-section" id="comprehensive-cost-section">
  </div>
</div>

<p>The cost landscape has a few clear takeaways. First, provider choice dominates: challenger storage providers with free egress and zero per-operation fees can reduce the monthly bill by 90% or more compared to established providers at the same chunk size and container configuration. Second, caching interacts with provider choice in non-obvious ways. A CDN cache in front of an established provider offloads expensive egress and read operations, producing large absolute savings. But the same cache in front of a free-egress provider like R2 or Wasabi adds cost without offsetting much, because there was little egress cost to begin with. Third, container packing remains essential regardless of provider or cache layer. Without it, per-operation costs at small chunk sizes overwhelm every other line item. The container abstraction from <a href="/writings/content-defined-chunking-part-4">Part 4</a> is not optional; it is a prerequisite for making CDC economically viable on any cloud object store.</p>

<h3 id="why-i-care-about-this">Why I Care About This</h3>

<p>This series grew out of my master’s thesis research, where I’m evaluating structure-aware chunking as a deduplication strategy for source code files on large version control platforms. Source code is a particularly interesting domain for chunking because individual files are typically small<span class="cdc-cite"><a href="#ref-27">[27]</a></span> and edits tend to be localized, small changes concentrated in specific functions or blocks<span class="cdc-cite"><a href="#ref-28">[28]</a></span>. This means even smaller chunk sizes may be appropriate since the overhead is bounded by the small file sizes involved.</p>

<p>If edits concentrate in specific functions and blocks, the natural extension of content-defined chunking is to define boundaries using the structure of the source code itself: functions, methods, classes, and modules. Instead of using a rolling hash window to identify chunk boundaries, you can parse the code into its syntactic units and chunk along those boundaries directly. <strong>cAST</strong> (chunking via abstract syntax tree; Zhang et al., 2025)<span class="cdc-cite"><a href="#ref-14">[14]</a></span> does exactly this in the context of retrieval-augmented code generation (RAG): It parses source code into an AST, recursively splitting large nodes and merging small siblings to produce chunks that align with function, class, and module boundaries. The result is semantically coherent chunks aligned to syntax nodes that improve both retrieval precision and generation quality across diverse programming languages.</p>

<p>My thesis asks whether aligning chunk boundaries to syntactic structures like functions, classes, and modules via AST parsing can outperform byte-level CDC for deduplicating source code across versions on large version control platforms. To understand the benefits cAST offers for source code deduplication, I’m comparing it against <strong>whole-file content-addressable storage</strong> as a baseline, modeling Git’s approach without its packfile and delta compression layers, and <strong>FastCDC v2020</strong> from the BSW family for byte-level content-defined chunking. The comparison spans ten programming languages across hundreds of public repositories with large commit histories, over a range of target chunk sizes, building an empirical cost model for the traditional tradeoffs between system resources (CPU and memory), network, and storage. The cost model should help clarify if and when the overhead of parsing and smaller chunks outweighs the core costs, and cloud costs, to achieve better deduplication. I’m also grounding these costs in two user-facing scenarios, file view and diff view, measuring time to first useful byte (TTFUB) to understand how each deduplication strategy’s overhead translates into latency that users actually feel.</p>

<h3 id="conclusion">Conclusion</h3>

<p>Every solution at one layer of abstraction creates problems at the next. Content-defined chunking solves fixed-size chunking’s fatal sensitivity to insertions and deletions by letting data determine its own boundaries. But deploying CDC at scale reveals that the chunking algorithm is only the beginning. Chunk size controls a web of interacting costs: storage efficiency, per-operation pricing, network egress, CPU overhead, and memory pressure. Containers decouple logical chunk granularity from physical object count, but at the cost of fragmentation, complex garbage collection, and a design space where container size, rewriting strategy, and GC policy all interact. Caching under Zipf access patterns can dramatically reduce read and egress costs, but the savings depend on provider pricing models that vary by orders of magnitude.</p>

<p>The deeper insight running through this series is that CDC’s power comes from its modularity. The chunking layer and the storage layer are cleanly separated. A chunking algorithm does not need to know whether its output will be stored as individual objects or packed into containers, cached at the edge or served from origin, priced per-operation or per-gigabyte. This separation of concerns is why the same core idea from Rabin’s 1981 fingerprinting still works in a 2025 cloud storage system with container packing, locality-preserved caching, and piggybacked GC-defragmentation. Each layer can evolve independently, and both have.</p>

<p>That modularity also points toward where the field goes next. Structure-aware chunking, like cAST’s use of abstract syntax trees for source code, raises an obvious question: what other domains have exploitable structure? Document formats, configuration files, database snapshots, and serialized protocol buffers all have internal structure that byte-level chunking ignores. On the performance side, VectorCDC’s SIMD acceleration shows that hardware-aware algorithm design can push throughput far beyond what scalar implementations achieve, and as instruction sets widen further, the gap will only grow. Beyond text and code, deduplication for images, video, and other binary formats remains largely unexplored territory where content-aware boundary detection could take entirely different forms.</p>

<p>Perhaps the most consequential open question is what role deduplication plays in the AI revolution ahead. Retrieval-augmented generation systems depend on chunking strategies that balance retrieval precision against chunk coherence. Model checkpointing, distributed training state, and inference caching all generate enormous volumes of partially redundant data. As AI workloads continue to scale, the economics of storing and transferring deduplicated data will only become more critical. The algorithm that lets data decide its own boundaries may have its most important work still ahead.</p>

<div class="cdc-footnote" id="fn-alpha">
<sup>†</sup> The <em>&alpha;</em> parameter is measured empirically by fitting real access logs to the Zipf model. The procedure is: collect access traces, rank items by frequency (most popular = rank 1), and plot log(rank) vs. log(frequency). If the access pattern follows a Zipf distribution, this log-log plot is approximately linear, and the slope of that line is &minus;<em>&alpha;</em>. A steeper slope means more skewed popularity. Breslau et al. and Berger et al. both used this fitting approach on web traffic and CDN traces to arrive at their measured <em>&alpha;</em> ranges.
</div>

<h3 id="references">References</h3>

<div class="cdc-references">

<div class="bib-entry" id="ref-14">
  <div class="bib-number">[14]</div>
  <div class="bib-citation">Y. Zhang, X. Zhao, Z. Z. Wang, C. Yang, J. Wei &amp; T. Wu, "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree," <em>arXiv:2506.15655</em>, 2025.</div>
  <div class="bib-links">
    <a href="https://arxiv.org/abs/2506.15655" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> arXiv</a>
  </div>
</div>

<div class="bib-entry" id="ref-27">
  <div class="bib-number">[27]</div>
  <div class="bib-citation">I. Herraiz, D. M. German &amp; A. E. Hassan, "On the Distribution of Source Code File Sizes," <em>6th International Conference on Software and Data Technologies (ICSOFT '11)</em>, 2011.</div>
  <div class="bib-links">
    <a href="https://www.researchgate.net/publication/220737991_On_the_Distribution_of_Source_Code_File_Sizes" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ResearchGate</a>
  </div>
</div>

<div class="bib-entry" id="ref-28">
  <div class="bib-number">[28]</div>
  <div class="bib-citation">O. Arafat &amp; D. Riehle, "The Commit Size Distribution of Open Source Software," <em>42nd Hawaii International Conference on System Sciences (HICSS-42)</em>, 2009.</div>
  <div class="bib-links">
    <a href="https://ieeexplore.ieee.org/document/4755633" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE</a>
  </div>
</div>

<div class="bib-entry" id="ref-37">
  <div class="bib-number">[37]</div>
  <div class="bib-citation">L. Breslau, P. Cao, L. Fan, G. Phillips &amp; S. Shenker, "Web Caching and Zipf-like Distributions: Evidence and Implications," <em>IEEE INFOCOM '99</em>, 1999.</div>
  <div class="bib-links">
    <a href="https://ieeexplore.ieee.org/document/749260" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE</a>
  </div>
</div>

<div class="bib-entry" id="ref-38">
  <div class="bib-number">[38]</div>
  <div class="bib-citation">D. S. Berger, N. Beckmann &amp; M. Harchol-Balter, "Practical Bounds on Optimal Caching with Variable Object Sizes," <em>Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS)</em>, Vol. 2, No. 2, 2018.</div>
  <div class="bib-links">
    <a href="https://dl.acm.org/doi/10.1145/3224427" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ACM</a>
  </div>
</div>

</div>

<div class="cdc-series-nav">
&larr; <a href="/writings/content-defined-chunking-part-4">Part 4: CDC in the Cloud</a> &middot; Back to <a href="/writings/content-defined-chunking-part-1">Part 1: From Problem to Taxonomy</a>
</div>

<script type="module" src="/assets/js/cdc-animations.js"></script>

]]>
      </content:encoded>
    </item>
    
    <item>
      <title>CDC in the Cloud</title>
      <link>https://rickwinfrey.com/writings/content-defined-chunking-part-4</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/content-defined-chunking-part-4</guid>
      <pubDate>Mon, 23 Feb 2026 12:00:00 +0000</pubDate>
      
      <description>CDC chunks are the right logical unit for deduplication, but storing them as individual objects is prohibitively expensive. This post explores containers, the storage abstraction that makes CDC viable at scale, and the fragmentation, garbage collection, and restore challenges they introduce.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<style>
/* ==========================================================================
   CDC Animation Styles
/* Cloud cost table */
.cost-cloud-section {
  padding: 0 1.25rem 0.5rem;
  border-top: 1px solid rgba(61, 58, 54, 0.1);
  margin-top: 0.5rem;
  padding-top: 1rem;
}

.cost-cloud-header {
  display: flex;
  flex-wrap: wrap;
  justify-content: space-between;
  align-items: baseline;
  gap: 0.5rem;
  margin-bottom: 0.75rem;
}

.cost-cloud-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: bold;
  color: #3d3a36;
}

.cost-cloud-workload {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem;
  color: #8b7355;
}

.cost-cloud-table {
  width: 100%;
  border-collapse: collapse;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  margin-bottom: 0.75rem;
}

.cost-cloud-table th {
  font-weight: bold;
  color: #3d3a36;
  text-align: right;
  padding: 0.35rem 0.5rem;
  border-bottom: 2px solid rgba(61, 58, 54, 0.15);
}

.cost-cloud-table th:first-child {
  text-align: left;
}

.cost-cloud-table td {
  padding: 0.35rem 0.5rem;
  text-align: right;
  color: #3d3a36;
  border-bottom: 1px solid rgba(61, 58, 54, 0.07);
}

.cost-cloud-table td:first-child {
  text-align: left;
  color: #8b7355;
}

.cost-cloud-table tr:last-child td {
  font-weight: bold;
  border-top: 2px solid rgba(61, 58, 54, 0.15);
  border-bottom: none;
}

.cost-cell-value {
  display: block;
}

.cost-cell-calc {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #a08b6e;
  line-height: 1.3;
}

.cost-cloud-assumptions {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.65rem;
  color: #8b7355;
  line-height: 1.5;
}

.cost-cloud-assumptions a {
  color: #c45a3b;
}

.cost-pricing-ref {
  margin: 0.5rem 0 0.75rem;
}
.cost-pricing-ref summary {
  cursor: pointer;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  color: #8b7355;
  list-style: none;
  padding: 0.25rem 0;
}
.cost-pricing-ref summary::-webkit-details-marker { display: none; }
.cost-pricing-ref summary::marker { display: none; }
.cost-pricing-ref summary::before {
  content: "\25B8  ";
  display: inline-block;
  transition: transform 0.2s;
}
.cost-pricing-ref[open] summary::before {
  transform: rotate(90deg);
}
.cost-ref-table tr:last-child td {
  font-weight: normal;
  border-top: none;
  border-bottom: 1px solid rgba(61, 58, 54, 0.07);
}

/* Container packing toggle */
.container-toggle {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  cursor: pointer;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  color: #3d3a36;
}

.container-toggle input[type="checkbox"] {
  -webkit-appearance: none;
  appearance: none;
  width: 36px;
  height: 20px;
  background: rgba(61, 58, 54, 0.2);
  border-radius: 10px;
  position: relative;
  cursor: pointer;
  transition: background 0.2s ease;
  flex-shrink: 0;
}

.container-toggle input[type="checkbox"]::after {
  content: '';
  position: absolute;
  top: 2px;
  left: 2px;
  width: 16px;
  height: 16px;
  background: #fff;
  border-radius: 50%;
  transition: transform 0.2s ease;
  box-shadow: 0 1px 3px rgba(0,0,0,0.2);
}

.container-toggle input[type="checkbox"]:checked {
  background: #2d7a4f;
}

.container-toggle input[type="checkbox"]:checked::after {
  transform: translateX(16px);
}

/* Savings row highlight */
.container-savings-row td {
  font-weight: bold !important;
  border-top: 2px solid rgba(45, 122, 79, 0.2) !important;
  border-bottom: none !important;
}

/* Jazz Cloud single-column table */
.jazz-cost-table {
  max-width: 24rem;
}

/* Comprehensive / single-column cost table */
.comprehensive-cost-table {
  max-width: 28rem;
}

/* Styled provider select dropdowns */
.cost-provider-select {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  color: #3d3a36;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.2);
  border-radius: 6px;
  padding: 0.35rem 2rem 0.35rem 0.6rem;
  appearance: none;
  -webkit-appearance: none;
  background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='10' height='6'%3E%3Cpath d='M0 0l5 6 5-6z' fill='%238b7355'/%3E%3C/svg%3E");
  background-repeat: no-repeat;
  background-position: right 0.6rem center;
  cursor: pointer;
  transition: border-color 0.2s ease;
}

.cost-provider-select:hover {
  border-color: rgba(61, 58, 54, 0.4);
}

.cost-provider-select:focus {
  outline: none;
  border-color: #c45a3b;
  box-shadow: 0 0 0 2px rgba(196, 90, 59, 0.15);
}

.cost-provider-select:disabled {
  opacity: 0.4;
  cursor: not-allowed;
}

@media (max-width: 42em) {
  .cost-cloud-table {
    font-size: 0.65rem;
  }
  .cost-cloud-table th,
  .cost-cloud-table td {
    padding: 0.25rem 0.3rem;
  }
  .cost-cell-calc {
    display: none;
  }
}
</style>

<div class="cdc-series-nav">
Part 4 of 5 in a series on Content-Defined Chunking. Previous: <a href="/writings/content-defined-chunking-part-3">Part 3: Deduplication in Action</a> &middot; Next: <a href="/writings/content-defined-chunking-part-5">Part 5: CDC at Scale on a Budget</a>
</div>

<p>In <a href="/writings/content-defined-chunking-part-3">Part 3</a>, we built a deduplication pipeline, explored its cost tradeoffs, and saw where CDC is deployed today. The deduplication ratios looked great, but we left a critical question unanswered: what happens when you actually store those billions of variable-size chunks on cloud object storage? Cloud providers charge per API operation, and the answer can be shocking.</p>

<p>This post starts with the cost problem that naive cloud chunk storage creates, then introduces the solution: containers, grouping many small chunks into larger, fixed-size storage objects. But every solution at one layer of abstraction creates problems at the next. Containers introduce fragmentation, make garbage collection hard, and create a design space where container size, rewriting strategy, and GC policy all interact. We’ll reference the decade of research focused on how to best solve these problems along the way.</p>

<h3 id="the-cloud-cost-problem">The Cloud Cost Problem</h3>

<p>Established cloud object storage providers charge per API operation. Every PUT writes one object; every GET reads one object. If you take a textbook CDC pipeline and store each unique chunk as its own object on S3, you get one PUT per chunk on the write path and one GET per chunk on the read path. With billions of small chunks, the per-operation costs add up fast.</p>

<p>The explorer below models exactly this scenario: a naive architecture where every chunk is a separate object. Use the same workload assumptions from Part 3 (100M users, 1 PB total data, 1B document reads per month, 50 edits per user per month) and drag the chunk size slider to see what happens.</p>

<div class="cdc-viz" id="naive-cost-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Established Object Storage Provider Cost Explorer</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Average Chunk Size: <strong id="naive-cost-chunk-value">32 KB</strong>
    </span>
    <input type="range" id="naive-cost-chunk-slider" min="0" max="100" value="50" step="1" />
  </div>
  <div class="cost-cloud-section" id="naive-cost-cloud-section">
  </div>
</div>

<p>At small chunk sizes, operations cost orders of magnitude more than storage. At 4 KB average chunk size, PUT and GET costs run into the tens of millions of dollars per month across all three providers, while storage itself costs only a few thousand. At 1 KB it gets even worse. Smaller chunks give better deduplication ratios, but the per-operation pricing model punishes you for using them.</p>

<p>This is the fundamental tension: smaller chunks improve deduplication, but cloud object storage charges you per object. Storing each chunk as its own object means the number of API operations scales with the number of chunks, not the number of bytes. The solution is to decouple the two.</p>

<h3 id="reducing-costs-through-containers">Reducing Costs through Containers</h3>

<p>Most developers know “container” as a Linux, Docker, or OCI concept. In the deduplication literature, the word means something entirely different. Zhu et al. (FAST ‘08) defined the container abstraction in the Data Domain paper as a way to decouple chunk granularity from the number of objects written to storage.<span class="cdc-cite"><a href="#ref-16">[16]</a></span> A deduplication container is a self-describing, immutable, fixed-size storage unit, typically a few megabytes, that groups many variable-size chunks together. Instead of storing each chunk as its own object, the system packs hundreds or thousands of chunks into a single container and writes that container as one I/O operation.</p>

<p>A container has two sections. The <strong>metadata section</strong> contains the fingerprint (cryptographic hash) of every chunk in the container, along with each chunk’s byte offset and length within the data section. The <strong>data section</strong> holds the actual chunk bytes, often compressed. The metadata section is compact enough to be read independently, which lets the system know what a container holds without reading the full data. This separation is critical for the optimizations that make container-based deduplication practical.</p>

<p>The <strong>write path</strong> works as follows. Incoming unique chunks, those that pass the deduplication check and are confirmed to be new, are buffered in memory. When enough chunks have accumulated to fill a container, they are packed together (metadata header followed by chunk data), and the entire container is written as a single I/O operation: one PUT on object storage, one sequential write on disk. After the write, the chunk index is updated to map each chunk’s hash to a tuple of (container_id, byte_offset, length). The buffering amortizes the cost of a write across many chunks. At 4 KB average chunk size and 4 MB container size, a single write covers roughly 1,000 chunks.</p>

<p>The <strong>read path</strong> inverts this. To retrieve a chunk, the system looks up its container_id, offset, and length in the chunk index, then issues a byte-range request on that container (a range GET on S3, or a positioned read on local disk). If multiple chunks from the same container are needed, a single read can fetch them all. In practice, restoring a file often requires chunks from several containers, so the system issues a set of range reads in parallel.</p>

<p>Zhu et al. introduced a key optimization that exploits container structure: <strong>locality-preserved caching</strong>.<span class="cdc-cite"><a href="#ref-16">[16]</a></span> When the system reads a container’s metadata section to check whether it contains a particular chunk, it caches all the fingerprints from that container in memory. Because write streams have locality, consecutive chunks in the input tend to end up in the same or nearby containers, this prefetches fingerprints for upcoming deduplication lookups. Combined with a Bloom filter for quickly skipping chunks that are definitely new (called a “summary vector” in the Data Domain paper), this locality-preserved caching eliminated 99% of on-disk index lookups during deduplication. The Bloom filter handles the common case (most incoming chunks are new and can be skipped without a disk seek), while the container metadata cache handles the remaining case (chunks that might be duplicates are likely co-located with recently seen duplicates).</p>

<p>A note on terminology: the deduplication research literature consistently uses the term “container” for this abstraction (Zhu et al., Lillibridge et al., Xia et al., and others).<span class="cdc-cite"><a href="#ref-16">[16]</a></span><span class="cdc-cite"><a href="#ref-18">[18]</a></span><span class="cdc-cite"><a href="#ref-15">[15]</a></span> Backup tools like Restic and Git use “packfile” for a similar concept. Git’s packfile format serves a somewhat different purpose (it includes delta compression between objects, not just packing), so this post uses the literature’s term to avoid conflation.</p>

<p>The key insight of the container abstraction is that it <strong>decouples logical chunk granularity from physical object count</strong>. You can have billions of 4 KB chunks but only millions of 4 MB containers. The chunk index grows with the number of unique chunks, but the storage layer deals with far fewer, larger objects. The CDC algorithm does not need to change. The same Gear hash or Rabin fingerprint that produces variable-size chunks feeds into the same deduplication lookup. The container is purely a storage-layer concern, invisible to the chunking logic above it.</p>

<h3 id="the-impact-of-containers-on-cost">The Impact of Containers on Cost</h3>

<p>Now that we understand the container abstraction, let’s revisit the cost picture. The explorer below shows both with and without container totals. Compare them to see the magnitude of cost savings that containers provide.</p>

<div class="cdc-viz" id="packed-cost-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Established Object Storage Provider Cost Explorer with Containers</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Average Chunk Size: <strong id="packed-cost-chunk-value">32 KB</strong>
    </span>
    <input type="range" id="packed-cost-chunk-slider" min="0" max="100" value="50" step="1" />
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Container Size: <strong id="packed-cost-container-value">4 MB</strong>
    </span>
    <input type="range" id="packed-cost-container-slider" min="1" max="64" value="4" step="1" />
  </div>
  <div class="cost-cloud-section" id="packed-cost-cloud-section">
  </div>
</div>

<p>The savings are dramatic. At 4 KB average chunk size with 4 MB containers, each container holds roughly 1,000 chunks. That means PUT operations drop by a factor of 1,000, and GET operations drop similarly (assuming reasonable locality, where chunks needed for a read tend to cluster in the same containers). Storage cost is identical in both modes because the same bytes are stored either way. But operations cost goes from dominating the monthly bill to being negligible.</p>

<p>This is why container packing is not an optimization. It is a <strong>prerequisite</strong> for running CDC-based deduplication on cloud object storage at any meaningful scale. Without it, the per-operation pricing model of S3, GCS, and Azure Blob Storage makes fine-grained chunking economically impossible. With it, the system can use whatever chunk size gives the best deduplication ratio, because the storage layer absorbs the object-count explosion transparently.</p>

<h3 id="more-containers-more-problems">More Containers More Problems</h3>

<p>Containers solve the operations cost problem, but they create three new problems at the next layer of abstraction. First, <strong>fragmentation</strong>: because deduplication shares chunks across versions, the chunks needed to reconstruct any single version become scattered across containers written at different times, degrading read performance. Second, <strong>garbage collection</strong>: deleting old versions only removes references to chunks, but a container can only be freed when every chunk in it is unreferenced, so dead space accumulates. Third, <strong>design tradeoffs</strong>: container size, chunk size, rewriting aggressiveness, and GC policy all interact in ways that make tuning surprisingly difficult. Each of these has been the subject of over a decade of focused research.</p>

<h3 id="the-fragmentation-problem">The Fragmentation Problem</h3>

<p>The setup is straightforward. Deduplication works by sharing chunks across versions. If a chunk from the current version already exists in the store, the new version simply references it rather than storing it again. This sharing is the entire point of deduplication. But it means the chunks needed to reconstruct the latest version of a file are physically scattered across containers that were written at different times, alongside chunks from unrelated files and earlier versions.</p>

<p>This fragmentation worsens predictably over time. When you write a file for the first time, all of its chunks are new, so they go into containers sequentially with perfect locality. Reading that file back means reading containers in order, each one packed with relevant data. The second version deduplicates most of its chunks against the first, so very few new containers are written. The chunks that are new land in fresh containers alongside other new chunks from that write cycle. So far, so good. But by the hundredth version, the latest version’s chunks are scattered across containers written during every prior cycle. Some containers hold mostly chunks from version 1. Others hold chunks from version 47. Reconstructing the current version now requires reading dozens or hundreds of containers, each containing mostly irrelevant chunks that happen to share a container with one or two chunks you need. This is <strong>read amplification</strong>: to get the bytes you want, you must read far more bytes than you need.</p>

<p>Lillibridge et al. (FAST ‘13) quantified this problem precisely.<span class="cdc-cite"><a href="#ref-30">[30]</a></span> They found that read speeds for the most recent version can degrade by orders of magnitude over a system’s lifetime as fragmentation accumulates. Each new version scatters its few new chunks across fresh containers, while the majority of its chunks (the deduplicated ones) point back to increasingly dispersed old containers.</p>

<p>The first mitigation came from Kaczmarczyk et al. (SYSTOR ‘12), who proposed <strong>Context-Based Rewriting (CBR)</strong>.<span class="cdc-cite"><a href="#ref-29">[29]</a></span> The idea is to intervene at ingest time: as chunks are being written, identify those that would land in containers with poor locality (containers where most other chunks are not part of the current write stream), and selectively <em>rewrite</em> those duplicate chunks into new containers alongside their neighbors. The key insight is that rewriting a small fraction of duplicates can dramatically improve restore locality. CBR rewrote only about 5% of duplicate chunks and reduced the restore performance degradation from 12-55% to only 4-7%.<span class="cdc-cite"><a href="#ref-29">[29]</a></span> The cost is a modest reduction in deduplication ratio (since those rewritten chunks now exist in two places), but the restore improvement far outweighs the extra storage.</p>

<p>Lillibridge et al. (FAST ‘13) proposed <strong>container capping</strong>, which directly limits how many distinct old containers a new version is allowed to reference.<span class="cdc-cite"><a href="#ref-30">[30]</a></span> The intuition is that restore performance degrades with the number of containers that must be read: if the current version’s chunks are spread across hundreds of containers, restoring it requires hundreds of reads, most of which fetch containers full of irrelevant data. Container capping sets a ceiling on that container count. When the incoming data stream would reference too many distinct containers, the system selectively rewrites some duplicate chunks into fresh containers alongside their neighbors, bringing the count under the cap. This is the same rewrite-to-improve-locality idea as CBR, but the trigger is different: CBR decides per-chunk based on the locality of the target container, while capping decides globally based on the total container count for the current version. Container capping sacrifices roughly 8% of the deduplication ratio in exchange for a 2-6x restore speed improvement.<span class="cdc-cite"><a href="#ref-30">[30]</a></span> The same paper also proposed <strong>forward assembly</strong>, a complementary technique that optimizes memory management during restores rather than the storage layout itself, achieving 2-4x restore improvement by using the file manifest to plan cache eviction decisions ahead of time.<span class="cdc-cite"><a href="#ref-30">[30]</a></span></p>

<p>Fu et al. (ATC ‘14) proposed <strong>History-Aware Rewriting (HAR)</strong>, which exploits version history to identify fragmented chunks more precisely.<span class="cdc-cite"><a href="#ref-31">[31]</a></span> Rather than applying a uniform rewriting threshold, HAR improves on CBR by analyzing past versions to determine which specific chunks are causing the worst read performance and rewrites only those. This targeted approach achieves similar restore improvements to CBR and container capping while rewriting fewer chunks, preserving more of the deduplication ratio.</p>

<p>The most recent major advance is <strong>MFDedup</strong> by Zou et al. (FAST ‘21), which claimed to resolve the deduplication-versus-locality dilemma that had defined the previous decade of research.<span class="cdc-cite"><a href="#ref-32">[32]</a></span> Their key observation was that most duplicate chunks in a given version come from the immediately previous version, not from distant history. MFDedup exploits this with two techniques: <strong>NDF (Neighbor-Duplicate-Focus)</strong> indexing, which deduplicates primarily against the previous version rather than the entire history, and <strong>AVAR (Across-Version-Aware Reorganization)</strong>, which rearranges chunks after ingest into a compact sequential layout optimized for restore. The result: 1.1-2.2x higher deduplication ratio <em>and</em> 2.6-11.6x faster restore compared to previous approaches.<span class="cdc-cite"><a href="#ref-32">[32]</a></span> MFDedup achieves this by recognizing that the traditional approach of deduplicating against all prior versions creates unnecessarily scattered references, when deduplicating against only the most recent version captures most of the savings while preserving far better locality.</p>

<p>The progression from CBR (2012) through container capping (2013) to HAR (2014) to MFDedup (2021) shows a field converging on a clear principle: <strong>ingest-time layout decisions are the primary lever for controlling fragmentation</strong>. You cannot fix fragmentation after the fact without rewriting data. The question is how much to rewrite and when. Early approaches (CBR, capping) used simple heuristics applied uniformly. Later approaches (HAR) added historical context for more targeted rewriting. MFDedup went further by rethinking the deduplication target itself, shifting from “deduplicate against everything” to “deduplicate against the most relevant prior version.” Each step reduced the tradeoff between deduplication ratio and restore locality.</p>

<h3 id="garbage-collection">Garbage Collection</h3>

<p>Containers solve the operations cost problem but make deletion hard. In a non-deduplicated system, deleting a file or an old version frees its storage immediately. In a deduplicated system, deleting a file only removes <em>references</em> to chunks. A chunk can only be freed when no remaining file or version references it. And because chunks live inside containers, a container can only be freed when <em>every</em> chunk in it is unreferenced. A single surviving chunk, perhaps shared by one remaining version, keeps the entire container alive. As old versions are deleted over time, containers accumulate dead space: unreferenced chunks that waste storage but cannot be reclaimed because they share a container with at least one live chunk.</p>

<p><strong>Logical garbage collection</strong> is the traditional approach.<span class="cdc-cite"><a href="#ref-33">[33]</a></span> Walk the file-level metadata (version manifests) to determine which chunks are still referenced by at least one live file or version. Mark all unreferenced chunks. Then identify containers where all chunks are unreferenced and reclaim those containers entirely. Containers with a mix of live and dead chunks present a choice: leave them as-is (wasting the dead space) or “compact” them by reading out the live chunks, packing them into new containers, and freeing the old ones. Compaction reclaims dead space but costs I/O: you must read the old containers, write new ones, and update the chunk index. At scale, the metadata walk itself is expensive because it must traverse every manifest in the system.</p>

<p>Douglis et al. (FAST ‘17) introduced a fundamentally different approach: <strong>physical garbage collection</strong>.<span class="cdc-cite"><a href="#ref-33">[33]</a></span> Instead of walking metadata top-down (enumerating all references to determine which chunks are live), physical GC walks the raw container storage bottom-up with sequential I/O. For each container, it reads the metadata section and checks which chunks are still live by consulting a compact liveness structure. This inverts the access pattern: instead of many random lookups into metadata, the system does one sequential scan of container storage. Physical GC was up to two orders of magnitude faster than logical GC for extreme workloads (very high dedup ratios or millions of small files) and 10-60% faster in common cases.<span class="cdc-cite"><a href="#ref-33">[33]</a></span> The key insight is that when dedup ratios are high, the number of containers is much smaller than the number of metadata references, so scanning containers is cheaper than scanning references.</p>

<p>The connection between GC and fragmentation is deeper than it first appears. GC rewrites containers to reclaim dead chunks. Defragmentation rewrites containers to improve chunk locality for restores. Both involve reading, filtering, and rewriting containers. They are, in a sense, the same operation applied with different goals.</p>

<p>Liu et al. (EuroSys ‘25) made this connection explicit with <strong>GCCDF (Garbage-Collection-Collaborative Defragmentation)</strong>, which unifies GC and defragmentation into a single pass.<span class="cdc-cite"><a href="#ref-34">[34]</a></span> The insight is that GC already reads partially-dead containers, extracts live chunks, and writes them into new containers. That rewriting step is exactly the opportunity to also fix fragmentation: instead of packing live chunks in arbitrary order, the system reorders them by restore locality, grouping chunks that will be read together into the same new container. One pass through the data, two benefits: reclaimed dead space and improved read performance. Running GC and defragmentation as separate operations would double the I/O cost, since each would independently read, filter, and rewrite containers. Piggybacking eliminates that redundancy.</p>

<p>On cloud object storage, GC and fragmentation take on additional economic significance. Dead chunks in containers waste storage cost (you pay per GB for bytes that no live file references). But the bigger lever is operations cost. Fragmentation scatters the live chunks needed to restore a file across more containers, which translates to more GET requests. GC and compaction consolidate live chunks into fewer, denser containers, improving locality and reducing the number of operations required to materialize a file. However, GC and compaction are not free, and can incur substantial operations costs to perform. There is a break-even point where the monthly savings from improved locality and reclaimed storage justify these operations costs. Finding that point requires modeling your application domain (read/write rates, change ratios, data growth) alongside the pricing scheme of your cloud provider. That analysis is out of scope for this series, but ideally the cost structure outlined here gives you enough context to start one for your application.</p>

<h3 id="container-size-as-the-primary-lever">Container Size as the Primary Lever</h3>

<p><a href="/writings/content-defined-chunking-part-3">Part 3</a> identified chunk size as the single parameter that ties all costs together. On cloud object storage, container size supplants it. Container size determines the number of objects in the bucket, the number of API operations per write and read, and the degree of read amplification when materializing files. Fu et al. (FAST ‘15) systematically explored how container size, chunk size, and rewriting aggressiveness interact for on-premises storage, finding that no single configuration dominates across all workloads.<span class="cdc-cite"><a href="#ref-35">[35]</a></span> Duggal et al. (ATC ‘19) confirmed that the container abstraction translates to cloud storage, where operations have higher latency, GC must factor in the cost of compaction itself, and every GET has a price.<span class="cdc-cite"><a href="#ref-36">[36]</a></span></p>

<p>Containers dramatically reduce costs on major cloud providers, but the cost story does not end there. A newer generation of S3-compatible services has emerged with pricing models that eliminate or sharply reduce per-operation and egress fees, and caching can cut read costs further still. <a href="/writings/content-defined-chunking-part-5">Part 5</a> explores how these challenger providers and caching layers can drive costs down well beyond what containers alone achieve.</p>

<h3 id="references">References</h3>

<div class="cdc-references">

<div class="bib-entry" id="ref-15">
  <div class="bib-number">[15]</div>
  <div class="bib-citation">W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang &amp; Y. Zhou, "A Comprehensive Study of the Past, Present, and Future of Data Deduplication," <em>Proceedings of the IEEE</em>, vol. 104, no. 9, pp. 1681-1710, September 2016.</div>
  <div class="bib-links">
    <a href="https://ieeexplore.ieee.org/document/7529062" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE</a>
  </div>
</div>

<div class="bib-entry" id="ref-16">
  <div class="bib-number">[16]</div>
  <div class="bib-citation">B. Zhu, K. Li &amp; H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," <em>6th USENIX Conference on File and Storage Technologies (FAST '08)</em>, San Jose, CA, February 2008.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/legacy/event/fast08/tech/full_papers/zhu/zhu.pdf" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> PDF</a>
  </div>
</div>

<div class="bib-entry" id="ref-18">
  <div class="bib-number">[18]</div>
  <div class="bib-citation">M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise &amp; P. Camble, "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," <em>7th USENIX Conference on File and Storage Technologies (FAST '09)</em>, San Jose, CA, February 2009.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/fast-09/sparse-indexing-large-scale-inline-deduplication-using-sampling-and-locality" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-29">
  <div class="bib-number">[29]</div>
  <div class="bib-citation">M. Kaczmarczyk, M. Barczynski, W. Kilian &amp; C. Dubnicki, "Reducing Impact of Data Fragmentation Caused by In-line Deduplication," <em>SYSTOR '12</em>, Haifa, Israel, June 2012.</div>
  <div class="bib-links">
    <a href="https://dl.acm.org/doi/10.1145/2367589.2367600" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ACM</a>
  </div>
</div>

<div class="bib-entry" id="ref-30">
  <div class="bib-number">[30]</div>
  <div class="bib-citation">M. Lillibridge, K. Eshghi &amp; D. Bhagwat, "Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication," <em>11th USENIX Conference on File and Storage Technologies (FAST '13)</em>, San Jose, CA, February 2013.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/fast13/technical-sessions/presentation/lillibridge" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-31">
  <div class="bib-number">[31]</div>
  <div class="bib-citation">M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang &amp; Y. Tan, "Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information," <em>USENIX ATC '14</em>, Philadelphia, PA, June 2014.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/atc14/technical-sessions/presentation/fu_min" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-32">
  <div class="bib-number">[32]</div>
  <div class="bib-citation">X. Zou et al., "The Dilemma between Deduplication and Locality: Can Both be Achieved?" <em>19th USENIX Conference on File and Storage Technologies (FAST '21)</em>, February 2021.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/fast21/presentation/zou" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-33">
  <div class="bib-number">[33]</div>
  <div class="bib-citation">F. Douglis, A. Duggal, P. Shilane, T. Wong, S. Yan &amp; F. Botelho, "The Logic of Physical Garbage Collection in Deduplicating Storage," <em>15th USENIX Conference on File and Storage Technologies (FAST '17)</em>, Santa Clara, CA, February 2017.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/fast17/technical-sessions/presentation/douglis" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-34">
  <div class="bib-number">[34]</div>
  <div class="bib-citation">D. Liu et al., "Garbage Collection Does Not Only Collect Garbage: Piggybacking-Style Defragmentation for Deduplicated Backup Storage," <em>EuroSys '25</em>, 2025.</div>
  <div class="bib-links">
    <a href="https://dl.acm.org/doi/10.1145/3689031.3717493" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ACM</a>
  </div>
</div>

<div class="bib-entry" id="ref-35">
  <div class="bib-number">[35]</div>
  <div class="bib-citation">M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang &amp; Y. Tan, "Design Tradeoffs for Data Deduplication Performance in Backup Workloads," <em>13th USENIX Conference on File and Storage Technologies (FAST '15)</em>, Santa Clara, CA, February 2015.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/fast15/technical-sessions/presentation/fu" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-36">
  <div class="bib-number">[36]</div>
  <div class="bib-citation">A. Duggal, F. Jenkins, P. Shilane, R. Chinthekindi, R. Shah &amp; M. Kamat, "Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere!" <em>USENIX ATC '19</em>, Renton, WA, July 2019.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/atc19/presentation/duggal" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

</div>

<div class="cdc-series-nav">
&larr; <a href="/writings/content-defined-chunking-part-3">Part 3: Deduplication in Action</a> &middot; Continue reading &rarr; <a href="/writings/content-defined-chunking-part-5">Part 5: CDC at Scale on a Budget</a>
</div>

<script type="module" src="/assets/js/cdc-animations.js"></script>

]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Deduplication in Action</title>
      <link>https://rickwinfrey.com/writings/content-defined-chunking-part-3</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/content-defined-chunking-part-3</guid>
      <pubDate>Mon, 16 Feb 2026 12:00:00 +0000</pubDate>
      
      <description>See CDC-based deduplication in action, learn where CDC is deployed today, and explore the frontier of structure-aware chunking for source code.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<style>
/* ==========================================================================
   CDC Animation Styles
/* View mode tabs (Text / Blocks / Hex) */
.cdc-view-tabs {
  display: flex;
  gap: 0.25rem;
  background: rgba(61, 58, 54, 0.05);
  padding: 0.25rem;
  border-radius: 6px;
}

.cdc-view-tab {
  padding: 0.4rem 0.75rem;
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.8rem;
  color: #8b7355;
  background: transparent;
  border: none;
  border-radius: 4px;
  cursor: pointer;
  transition: all 0.15s ease;
}

.cdc-view-tab:hover {
  color: #3d3a36;
}

.cdc-view-tab.active {
  background: #fff;
  color: #c45a3b;
  box-shadow: 0 1px 3px rgba(0,0,0,0.1);
}

/* Content display area */
.cdc-content {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 1rem;
  line-height: 1.8;
  color: #3d3a36;
}

/* Text view with chunk highlighting */
.cdc-text-view {
  white-space: pre-wrap;
  word-break: break-word;
}

.cdc-text-view .chunk {
  display: inline;
  padding: 0.1rem 0;
  border-radius: 2px;
  transition: background-color 0.2s ease;
}

/* Chunk colors - warm palette matching site */
.cdc-text-view .chunk-0 { background-color: rgba(196, 90, 59, 0.15); }
.cdc-text-view .chunk-1 { background-color: rgba(212, 165, 116, 0.25); }
.cdc-text-view .chunk-2 { background-color: rgba(139, 115, 85, 0.15); }
.cdc-text-view .chunk-3 { background-color: rgba(196, 90, 59, 0.25); }
.cdc-text-view .chunk-4 { background-color: rgba(212, 165, 116, 0.15); }
.cdc-text-view .chunk-5 { background-color: rgba(139, 115, 85, 0.25); }

/* Block view */
.cdc-blocks-view {
  display: flex;
  align-items: stretch;
  gap: 2px;
  margin-top: 1rem;
  padding: 0.5rem 0;
  width: 100%;
}

.cdc-block {
  height: 24px;
  border-radius: 3px;
  transition: all 0.2s ease;
  position: relative;
}

.cdc-block.chunk-0 { background-color: #c45a3b; }
.cdc-block.chunk-1 { background-color: #d4a574; }
.cdc-block.chunk-2 { background-color: #8b7355; }
.cdc-block.chunk-3 { background-color: #c45a3b; opacity: 0.7; }
.cdc-block.chunk-4 { background-color: #d4a574; opacity: 0.7; }
.cdc-block.chunk-5 { background-color: #8b7355; opacity: 0.7; }

.cdc-block:hover {
  transform: scaleY(1.2);
  z-index: 1;
}

/* Hex view */
.cdc-hex-view {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  line-height: 1.6;
  display: flex;
  flex-wrap: wrap;
  gap: 0.5rem 1rem;
}

.cdc-hex-byte {
  padding: 0.15rem 0.3rem;
  border-radius: 2px;
}

/* File icon visualization for fixed vs CDC comparison */
.cdc-file-icon {
  position: relative;
  background: #fff;
  border: 1px solid #d0d0d0;
  border-radius: 3px;
  padding: 1.5rem;
  padding-top: 2.25rem;
  margin: 0.75rem 0;
  box-shadow: 0 2px 8px rgba(0, 0, 0, 0.08);
}

/* Folded corner effect */
.cdc-file-corner {
  position: absolute;
  top: 0;
  right: 0;
  width: 0;
  height: 0;
  border-style: solid;
  border-width: 0 24px 24px 0;
  border-color: transparent #faf9f7 transparent transparent;
  filter: drop-shadow(-1px 1px 1px rgba(0, 0, 0, 0.1));
}

.cdc-file-corner::before {
  content: '';
  position: absolute;
  top: 0;
  right: -24px;
  width: 0;
  height: 0;
  border-style: solid;
  border-width: 0 0 24px 24px;
  border-color: transparent transparent #e8e8e8 transparent;
}

.cdc-file-label {
  position: absolute;
  top: 0.6rem;
  left: 1rem;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  color: #8b7355;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.cdc-file-content {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  line-height: 2.2;
  color: #3d3a36;
  white-space: pre-wrap;
  word-break: break-word;
}

.cdc-chunk-explanation {
  font-size: 0.8rem;
  color: #8b7355;
  margin: 0.25rem 0 0.5rem 0;
  font-style: italic;
}

/* Chunk spans with box styling - matches CHUNK_SOLID_COLORS from cdc-animations.js */
.cdc-chunk {
  padding: 0.2rem 0.35rem;
  border-radius: 3px;
  border: 2px solid;
  display: inline;
  box-decoration-break: clone;
  -webkit-box-decoration-break: clone;
}

.cdc-chunk.chunk-a {
  background: rgba(196, 90, 59, 0.15);
  border-color: #c45a3b;
}

.cdc-chunk.chunk-b {
  background: rgba(90, 138, 90, 0.15);
  border-color: #5a8a5a;
}

.cdc-chunk.chunk-c {
  background: rgba(70, 110, 160, 0.15);
  border-color: #466ea0;
}

.cdc-chunk.chunk-d {
  background: rgba(160, 100, 50, 0.15);
  border-color: #a06432;
}

.cdc-chunk.chunk-e {
  background: rgba(130, 80, 150, 0.15);
  border-color: #825096;
}

/* New chunk - terracotta accent to match interactive demos */
.cdc-chunk.chunk-new {
  background: rgba(196, 90, 59, 0.2);
  border-color: #c45a3b;
  border-style: solid;
}

/* Unchanged chunk - muted gray, matches shared/dedup style in animations */
.cdc-chunk.unchanged {
  background: rgba(61, 58, 54, 0.06);
  border-color: rgba(61, 58, 54, 0.2);
  color: #8b8178;
}

/* Changed chunk - dashed border to signal the chunk content shifted */
.cdc-chunk.changed {
  border-style: dashed;
}

/* Chunk Comparison Demo (JS-powered before/after) */
.cdc-chunk-comparison-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  color: #8b7355;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  margin-bottom: 0.4rem;
}

.cdc-chunk-comparison-file {
  margin-bottom: 0.75rem;
}

.cdc-chunk-comparison-text {
  white-space: pre-wrap;
  word-break: break-word;
  padding: 0.75rem;
  background: rgba(61, 58, 54, 0.02);
  border-radius: 6px;
  border: 1px solid rgba(61, 58, 54, 0.06);
  margin-bottom: 0.5rem;
  font-size: 0.85rem;
  line-height: 1.6;
}

.cdc-cmp-chunk {
  padding: 0.15rem 0.25rem;
  border-radius: 3px;
  border: 2px solid;
  display: inline-block;
  cursor: default;
  transition: filter 0.1s ease;
}

.cdc-cmp-chunk.unchanged {
  background: rgba(61, 58, 54, 0.06);
  border-color: rgba(61, 58, 54, 0.2);
  color: #8b8178;
}

.cdc-cmp-chunk.new {
  border-style: solid;
}

.cdc-cmp-chunk.chunk-hover {
  filter: brightness(0.82);
  outline: 3px solid rgba(61, 58, 54, 0.5);
  outline-offset: 0px;
  box-shadow: 0 0 8px rgba(0, 0, 0, 0.15);
}

.cdc-cmp-chunk.unchanged.chunk-hover {
  filter: brightness(0.85);
  outline: 3px solid rgba(61, 58, 54, 0.4);
  background: rgba(61, 58, 54, 0.15);
}

/* Chunk wrapper: label above, text below */
.cdc-cmp-chunk-wrapper {
  display: inline-flex;
  flex-direction: column;
  align-items: center;
  vertical-align: top;
  margin: 0.15rem 0.2rem;
}

.cdc-chunk-summary {
  display: grid;
  grid-template-columns: repeat(4, 1fr);
  gap: 0.75rem;
  padding: 0.75rem;
  margin: 0.5rem 0;
  background: rgba(61, 58, 54, 0.03);
  border-radius: 6px;
  border: 1px solid rgba(61, 58, 54, 0.06);
}

.cdc-chunk-summary-stat {
  text-align: center;
}

.cdc-chunk-summary-value {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 1.1rem;
  font-weight: 600;
  line-height: 1.2;
}

.cdc-chunk-summary-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.65rem;
  color: #8b7355;
  margin-top: 0.2rem;
  text-transform: uppercase;
  letter-spacing: 0.04em;
}

.cdc-cmp-chunk-label {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  font-weight: 600;
  text-align: center;
  letter-spacing: 0.02em;
  margin-bottom: 0.15rem;
}

/* Edit indicator arrow */
.cdc-edit-indicator {
  text-align: center;
  font-size: 0.8rem;
  color: #8b7355;
  padding: 0.5rem 0;
}

/* Deduplication result */
.cdc-dedup-result {
  text-align: center;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: 600;
  padding: 0.75rem;
  border-radius: 6px;
  margin-top: 0.75rem;
}

.cdc-dedup-result.bad {
  background: rgba(196, 90, 59, 0.1);
  color: #a84832;
}

.cdc-dedup-result.good {
  background: rgba(90, 160, 90, 0.1);
  color: #3d8b3d;
}

/* Rolling window indicator */
.cdc-window {
  position: absolute;
  height: 100%;
  background: rgba(196, 90, 59, 0.3);
  border: 2px solid #c45a3b;
  border-radius: 4px;
  pointer-events: none;
  transition: left 0.1s ease;
}

/* Hash display */
.cdc-hash-display {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.8rem;
  color: #8b7355;
  min-height: 1.4em;
}

.cdc-hash-display strong {
  color: #c45a3b;
}

/* Controls panel */
.cdc-controls {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(140px, 1fr));
  gap: 1.25rem;
  padding: 1.25rem;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-top: none;
  border-radius: 0 0 8px 8px;
}

.cdc-control-group {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.cdc-control-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  color: #3d3a36;
}

.cdc-controls input[type="range"] {
  width: 100%;
  height: 6px;
  -webkit-appearance: none;
  appearance: none;
  background: linear-gradient(to right, #d4a574, #c45a3b);
  border-radius: 3px;
  outline: none;
}

.cdc-controls input[type="range"]::-webkit-slider-thumb {
  -webkit-appearance: none;
  appearance: none;
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
  transition: transform 0.15s ease;
}

.cdc-controls input[type="range"]::-webkit-slider-thumb:hover {
  transform: scale(1.1);
}

.cdc-controls input[type="range"]::-moz-range-thumb {
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  border: none;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
}

/* Playback controls */
.cdc-playback {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  padding: 1rem 1.25rem;
  background: rgba(61, 58, 54, 0.02);
  border-top: 1px solid rgba(61, 58, 54, 0.08);
}

.cdc-playback-btn {
  width: 36px;
  height: 36px;
  border-radius: 50%;
  border: none;
  background: #c45a3b;
  color: #fff;
  cursor: pointer;
  display: flex;
  align-items: center;
  justify-content: center;
  transition: all 0.15s ease;
}

.cdc-playback-btn:hover {
  background: #a84832;
  transform: scale(1.05);
}

.cdc-playback-btn.secondary {
  background: rgba(61, 58, 54, 0.1);
  color: #3d3a36;
}

.cdc-playback-btn.secondary:hover {
  background: rgba(61, 58, 54, 0.2);
}

.cdc-speed-control {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-left: auto;
}

.cdc-speed-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  color: #8b7355;
}

/* Progress indicator */
.cdc-progress {
  flex: 1;
  height: 4px;
  background: rgba(61, 58, 54, 0.1);
  border-radius: 2px;
  overflow: hidden;
  margin: 0 0.5rem;
}

.cdc-progress-bar {
  height: 100%;
  background: linear-gradient(to right, #d4a574, #c45a3b);
  border-radius: 2px;
  transition: width 0.1s ease;
}

/* Side-by-side comparison */
.cdc-comparison {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
  margin: 2rem 0;
}

@media (max-width: 50em) {
  .cdc-comparison {
    grid-template-columns: 1fr;
  }
}

.cdc-comparison-panel {
  padding: 1.25rem;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-comparison-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 1rem;
  font-weight: 600;
  color: #3d3a36;
  margin-bottom: 1rem;
  padding-bottom: 0.75rem;
  border-bottom: 1px solid rgba(61, 58, 54, 0.1);
}

/* Chunk boundary marker */
.cdc-boundary-marker {
  display: inline-block;
  width: 2px;
  height: 1.2em;
  background: #c45a3b;
  margin: 0 1px;
  vertical-align: middle;
  border-radius: 1px;
}

/* Stats display */
.cdc-stats {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(120px, 1fr));
  gap: 1rem;
  padding: 1rem;
  background: rgba(61, 58, 54, 0.02);
  border-radius: 6px;
  margin-top: 1rem;
}

.cdc-stat {
  text-align: center;
}

.cdc-stat-value {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 1.5rem;
  font-weight: 600;
  color: #c45a3b;
}

.cdc-stat-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  color: #8b7355;
  margin-top: 0.25rem;
}

/* Deduplication visualization */
.cdc-dedup-viz {
  margin: 2rem 0;
}

.cdc-dedup-files {
  display: grid;
  grid-template-columns: 1fr auto 1fr;
  gap: 1rem;
  align-items: start;
}

.cdc-dedup-arrow {
  display: flex;
  align-items: center;
  justify-content: center;
  padding: 2rem 0;
  color: #8b7355;
  font-size: 1.5rem;
}

.cdc-dedup-storage {
  margin-top: 1.5rem;
  padding: 1.25rem;
  background: linear-gradient(135deg, rgba(196, 90, 59, 0.05) 0%, rgba(212, 165, 116, 0.08) 100%);
  border-radius: 8px;
}

.cdc-dedup-storage-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem;
  font-weight: 600;
  color: #3d3a36;
  margin-bottom: 0.75rem;
}

.cdc-dedup-chunks {
  display: flex;
  flex-wrap: wrap;
  gap: 0.75rem;
  overflow: visible;
}

.cdc-dedup-chunk {
  padding: 0.4rem 0.75rem;
  border-radius: 4px;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  color: #fff;
}

.cdc-dedup-chunk.shared {
  box-shadow: 0 0 0 2px #fff, 0 0 0 5px #3d3a36;
}

.cdc-dedup-chunk-badge {
  position: absolute;
  top: -6px;
  right: -6px;
  background: #3d3a36;
  color: #fff;
  font-size: 0.55rem;
  font-weight: 700;
  line-height: 1;
  padding: 2px 4px;
  border-radius: 8px;
  pointer-events: none;
}

/* Versioned Dedup - Editor */
.cdc-dedup-editor { display: flex; flex-direction: column; gap: 0.75rem; margin-bottom: 1.5rem; }

.cdc-dedup-textarea {
  width: 100%; min-height: 120px; padding: 0.75rem;
  font-family: 'Source Serif 4', Georgia, serif; font-size: 0.9rem; line-height: 1.6;
  color: #3d3a36; background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.2); border-radius: 6px;
  resize: vertical; box-sizing: border-box;
}
.cdc-dedup-textarea:focus { outline: none; border-color: #c45a3b; box-shadow: 0 0 0 2px rgba(196, 90, 59, 0.15); }

.cdc-dedup-save-btn {
  align-self: flex-start; padding: 0.5rem 1.25rem;
  font-family: 'Libre Baskerville', Georgia, serif; font-size: 0.85rem;
  color: #fff; background: #c45a3b; border: none; border-radius: 6px;
  cursor: pointer; transition: background 0.15s ease, transform 0.1s ease;
}
.cdc-dedup-save-btn:hover { background: #a84832; transform: translateY(-1px); }
.cdc-dedup-save-btn:active { transform: translateY(0); }

/* Versioned Dedup - Timeline */
.cdc-dedup-timeline { position: relative; margin-bottom: 1.5rem; }

.cdc-version-entry { display: flex; gap: 1rem; padding-bottom: 1.5rem; position: relative; }

.cdc-version-entry:not(:last-child)::before {
  content: ''; position: absolute; top: 15px; left: 5px;
  width: 2px; bottom: 0; background: rgba(61, 58, 54, 0.15);
}

.cdc-version-dot {
  position: relative; flex-shrink: 0;
  width: 12px; height: 12px; margin-top: 3px;
  background: #c45a3b; border-radius: 50%;
  border: 2px solid #fff; box-shadow: 0 0 0 1px rgba(61, 58, 54, 0.2);
}

.cdc-version-content { flex: 1; min-width: 0; }

.cdc-version-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem; font-weight: 600; color: #3d3a36; margin-bottom: 0.5rem;
}

.cdc-version-cols { display: grid; grid-template-columns: 1fr 180px; gap: 1rem; align-items: start; }

.cdc-version-text {
  white-space: pre-wrap; word-break: break-word;
  padding: 0.5rem; background: rgba(61, 58, 54, 0.02);
  border-radius: 6px; border: 1px solid rgba(61, 58, 54, 0.06);
}

.cdc-version-blocks { display: flex; flex-direction: column; gap: 0.5rem; }
.cdc-version-blocks .cdc-blocks-view { margin-top: 0; min-height: 24px; }

.cdc-version-stats {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem; color: #8b7355; line-height: 1.4;
}

.cdc-dedup-timeline-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem; font-weight: 600; color: #8b7355;
  text-transform: uppercase; letter-spacing: 0.06em;
  margin-bottom: 0.75rem;
}

[data-chunk-hash].hash-hover {
  filter: brightness(0.85);
  outline: 2px solid rgba(61, 58, 54, 0.4);
  outline-offset: -1px;
}

@media (max-width: 42em) {
  .cdc-version-cols { grid-template-columns: 1fr; }
}

/* Beginner breadcrumb */
/* Table of Contents */
.cdc-toc {
  margin: 2rem 0;
  padding: 1.25rem 1.5rem;
  background: rgba(61, 58, 54, 0.03);
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-toc strong {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.95rem;
  color: #3d3a36;
}

.cdc-toc ol {
  margin: 0.75rem 0 0 0;
  padding-left: 1.25rem;
}

.cdc-toc li {
  margin-bottom: 0.4rem;
  font-size: 0.9rem;
  line-height: 1.5;
  color: #5a564f;
}

.cdc-toc a {
  color: #c45a3b;
  text-decoration: none;
  font-weight: 600;
}

.cdc-toc a:hover {
  text-decoration: underline;
}

.cdc-toc ul {
  margin: 0.25rem 0 0.25rem 0;
  padding-left: 1.25rem;
  list-style: none;
}

.cdc-toc ul li {
  margin-bottom: 0.15rem;
  font-size: 0.82rem;
  color: #8b7355;
}

.cdc-toc ul li a {
  font-weight: 400;
  color: #8b7355;
}

.cdc-toc ul li a:hover {
  color: #c45a3b;
}

/* Taxonomy tree diagram */
.cdc-taxonomy {
  padding: 1.25rem 1.5rem;
  background: rgba(61, 58, 54, 0.03);
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-taxonomy strong {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.95rem;
  color: #3d3a36;
}

.cdc-taxonomy-tree {
  margin-top: 0.75rem;
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0;
}

/* Root node */
.cdc-tax-root {
  padding: 0.4rem 1rem;
  background: #3d3a36;
  color: #fff;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  font-weight: 600;
  border-radius: 6px;
  text-align: center;
}

/* Vertical connector from root */
.cdc-tax-vline {
  width: 2px;
  height: 16px;
  background: rgba(61, 58, 54, 0.25);
}

/* Horizontal bar connecting the three families */
.cdc-tax-hbar {
  width: 80%;
  height: 2px;
  background: rgba(61, 58, 54, 0.25);
  position: relative;
}

/* Three-column family layout */
.cdc-tax-families {
  display: grid;
  grid-template-columns: 1fr 1fr 1fr;
  gap: 0.5rem;
  width: 100%;
  margin-top: 0;
}

.cdc-tax-family {
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0;
}

/* Vertical connector from hbar to family label */
.cdc-tax-family .cdc-tax-vline {
  height: 12px;
}

.cdc-tax-family-label {
  padding: 0.3rem 0.5rem;
  border-radius: 5px;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  text-align: center;
  white-space: nowrap;
}

.cdc-tax-family-label.bsw {
  background: rgba(196, 90, 59, 0.12);
  color: #c45a3b;
  border: 1px solid rgba(196, 90, 59, 0.25);
}

.cdc-tax-family-label.extrema {
  background: rgba(42, 125, 79, 0.1);
  color: #2a7d4f;
  border: 1px solid rgba(42, 125, 79, 0.2);
}

.cdc-tax-family-label.statistical {
  background: rgba(139, 115, 85, 0.12);
  color: #8b7355;
  border: 1px solid rgba(139, 115, 85, 0.25);
}

.cdc-tax-algorithms {
  margin-top: 0.35rem;
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0.2rem;
}

.cdc-tax-algo {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.65rem;
  color: #5a564f;
  line-height: 1.3;
}

.cdc-tax-algo .cdc-tax-year {
  color: #a89b8c;
}

.cdc-learn-more {
  display: inline-flex;
  align-items: center;
  gap: 0.4rem;
  margin-top: 0.75rem;
  padding: 0.4rem 0.75rem;
  background: rgba(212, 165, 116, 0.15);
  border-radius: 4px;
  font-size: 0.8rem;
  font-style: normal;
  color: #8b7355;
}

.cdc-learn-more::before {
  content: "💡";
}

/* Combined text + hex view */
.cdc-combined-view {
  display: flex;
  flex-wrap: wrap;
  gap: 1px;
}

.cdc-byte-col {
  display: flex;
  flex-direction: column;
  align-items: center;
  border-radius: 2px;
  padding: 0.15rem 0.1rem;
}

.cdc-byte-char {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.95rem;
  line-height: 1.4;
  color: #3d3a36;
}

.cdc-byte-hex {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  line-height: 1;
  color: #8b7355;
  margin-top: 1px;
}

/* Block annotation bar (below text/hex views) */
.cdc-block-wrapper {
  display: flex;
  flex-direction: column;
  align-items: center;
}

.cdc-block {
  width: 100%;
}

.cdc-block-annotation {
  width: 100%;
  position: relative;
  margin-top: 0.3rem;
}

.cdc-block-line {
  width: 100%;
  height: 0;
  border-top: 1.5px solid #8b7355;
  opacity: 0.5;
}

.cdc-block-tick {
  position: absolute;
  left: 50%;
  top: 0;
  transform: translateX(-50%);
  width: 1.5px;
  height: 8px;
  background: #8b7355;
  opacity: 0.5;
}

.cdc-block-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #8b7355;
  white-space: nowrap;
  user-select: none;
  line-height: 1;
  text-align: center;
  margin-top: 10px;
  overflow: hidden;
  text-overflow: ellipsis;
}

/* Chunk hover highlights */
.cdc-combined-view .cdc-byte-col.chunk-hover {
  filter: brightness(0.85);
  outline: 1px solid rgba(61, 58, 54, 0.25);
  outline-offset: -1px;
}

.cdc-block-wrapper.chunk-hover .cdc-block {
  filter: brightness(1.15);
  box-shadow: 0 0 6px rgba(0, 0, 0, 0.2);
}

.cdc-block-wrapper.chunk-hover .cdc-block-label {
  color: #3d3a36;
  font-weight: 600;
}

.cdc-text-view .chunk.chunk-hover {
  filter: brightness(0.9);
  outline: 1px solid rgba(61, 58, 54, 0.3);
  outline-offset: -1px;
}

/* Gear Lookup Table grid */
.gear-table-grid {
  display: grid;
  grid-template-columns: repeat(16, 1fr);
  gap: 1px;
  margin-top: 0.5rem;
}

.gear-table-cell {
  height: 15px;
  border-radius: 1px;
  cursor: pointer;
  transition: transform 0.1s ease, box-shadow 0.15s ease;
  position: relative;
}

.gear-table-cell:hover {
  transform: scale(1.4);
  z-index: 2;
  box-shadow: 0 0 4px rgba(0,0,0,0.2);
}

.gear-table-cell.active {
  outline: 2px solid #c45a3b;
  outline-offset: 0px;
  box-shadow: 0 0 6px rgba(196, 90, 59, 0.5);
  z-index: 3;
  transform: scale(1.4);
}

.gear-table-readout {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.8rem;
  color: #8b7355;
  min-height: 1.4em;
}

.gear-table-readout strong {
  color: #c45a3b;
}

/* Rolling hash window strip */
.gear-hash-window {
  display: flex;
  gap: 1px;
  overflow-x: auto;
  padding: 0.25rem 0;
  margin-bottom: 0.5rem;
  min-height: 3.2rem;
}

.gear-hw-cell {
  display: flex;
  flex-direction: column;
  align-items: center;
  min-width: 2rem;
  padding: 0.2rem 0.15rem;
  border-radius: 3px;
  background: rgba(61, 58, 54, 0.04);
  transition: background-color 0.15s ease;
}

.gear-hw-cell.current {
  outline: 2px solid #c45a3b;
  outline-offset: -1px;
}

.gear-hw-cell.boundary {
  border-right: 2px solid #2a7d4f;
}

.gear-hw-char {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.8rem;
  color: #3d3a36;
  line-height: 1.2;
}

.gear-hw-hash {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.5rem;
  color: #8b7355;
  line-height: 1;
  margin-top: 2px;
}

.gear-hw-hash.boundary {
  color: #2a7d4f;
  font-weight: 700;
}

/* Bit-shift visualization */
.gear-shift-viz {
  margin-bottom: 0.5rem;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
}

.gear-shift-row {
  display: flex;
  align-items: center;
  gap: 0.3rem;
  margin-bottom: 3px;
}

.gear-shift-label {
  width: 4rem;
  text-align: right;
  color: #8b7355;
  font-size: 0.6rem;
  flex-shrink: 0;
}

.gear-shift-hex {
  width: 5.5rem;
  text-align: right;
  color: #3d3a36;
  font-size: 0.6rem;
  flex-shrink: 0;
  padding-right: 0.3rem;
}

.gear-shift-bits {
  display: flex;
  gap: 0;
  position: relative;
}

.gear-bit {
  width: 7px;
  height: 14px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.45rem;
  line-height: 1;
  border-radius: 1px;
}

.gear-bit.b0,
.gear-bit.b1 {
  background: rgba(61, 58, 54, 0.06);
  color: #3d3a36;
}

.gear-bit.dropped {
  background: rgba(196, 90, 59, 0.25);
  color: #c45a3b;
  text-decoration: line-through;
}

.gear-bit.entering {
  background: rgba(90, 138, 90, 0.25);
  color: #5a8a5a;
  font-weight: 700;
}

@keyframes gear-slide-left {
  0% { transform: translateX(7px); opacity: 0.5; }
  100% { transform: translateX(0); opacity: 1; }
}

.gear-shift-bits.animated .gear-bit {
  animation: gear-slide-left 0.25s ease-out;
}

.gear-shift-box {
  border: 1.5px solid rgba(196, 90, 59, 0.3);
  border-radius: 6px;
  padding: 0.4rem 0.5rem;
  background: rgba(196, 90, 59, 0.02);
}

.gear-shift-connector {
  text-align: center;
  color: #8b7355;
  font-size: 0.7rem;
  line-height: 1;
  padding: 0.15rem 0;
}

.gear-shift-add {
  border: 1.5px solid rgba(61, 58, 54, 0.12);
  border-radius: 6px;
  padding: 0.4rem 0.5rem;
  background: rgba(61, 58, 54, 0.02);
}

.gear-shift-separator {
  width: calc(32 * 7px);
  border-top: 1px solid rgba(61, 58, 54, 0.15);
  margin: 2px 0;
}

/* Two-column layout: Operation panel + Gear table */
.gear-two-col {
  display: flex;
  gap: 1.5rem;
  margin-top: 1rem;
  align-items: flex-start;
}

.gear-col-left {
  flex: 1 1 0;
  min-width: 0;
}

.gear-col-right {
  flex: 1 1 0;
  min-width: 0;
}

/* Chunk boundary marker (vertical separator) */
.chunk-boundary-marker {
  display: inline-block;
  width: 2px;
  height: 1.2em;
  background: #c45a3b;
  margin: 0 2px;
  vertical-align: middle;
  border-radius: 1px;
  opacity: 0.6;
}

.chunk-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #8b7355;
  background: rgba(61, 58, 54, 0.06);
  padding: 0.1rem 0.3rem;
  border-radius: 2px;
  margin-right: 2px;
  vertical-align: top;
  line-height: 1;
  user-select: none;
}

/* Parametric Chunking Explorer - distribution chart */
.parametric-distribution-chart {
  position: relative;
  display: flex;
  align-items: flex-end;
  gap: 2px;
  height: 120px;
  padding: 0.5rem 0;
  margin-bottom: 1rem;
}

.parametric-dist-bar {
  flex: 1;
  min-width: 3px;
  border-radius: 2px 2px 0 0;
  transition: height 0.2s ease;
  cursor: pointer;
  position: relative;
}

.parametric-dist-bar:hover { opacity: 0.8; }

.parametric-dist-tooltip {
  display: none;
  position: absolute;
  bottom: 100%;
  left: 50%;
  transform: translateX(-50%);
  background: #3d3a36;
  color: #fff;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem;
  padding: 0.25rem 0.5rem;
  border-radius: 4px;
  white-space: nowrap;
  pointer-events: none;
  margin-bottom: 4px;
  z-index: 10;
}

.parametric-dist-bar:hover .parametric-dist-tooltip { display: block; }

.parametric-dist-reference {
  position: absolute;
  left: 0;
  right: 0;
  border-top: 2px dashed rgba(196, 90, 59, 0.5);
  pointer-events: none;
  z-index: 1;
}

.parametric-dist-reference-label {
  position: absolute;
  right: 0;
  top: -1.1rem;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #c45a3b;
  white-space: nowrap;
}

.parametric-dist-bar.chunk-hover {
  filter: brightness(1.15);
  box-shadow: 0 0 6px rgba(0, 0, 0, 0.2);
}

.parametric-derived-params {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  color: #8b7355;
  white-space: nowrap;
}

/* Mobile responsive */
@media (max-width: 42em) {
  .cdc-controls {
    grid-template-columns: 1fr;
  }

  .cdc-dedup-files {
    grid-template-columns: 1fr;
  }

  .cdc-dedup-arrow {
    transform: rotate(90deg);
    padding: 1rem 0;
  }

  .cdc-chunk-summary {
    grid-template-columns: repeat(2, 1fr);
  }

  .cdc-hex-view {
    font-size: 0.65rem;
  }

  .gear-two-col {
    flex-direction: column;
  }

  .gear-table-grid {
    gap: 0px;
  }

  .gear-table-cell {
    border-radius: 0;
    height: 12px;
  }
}

/* ==========================================================================
   Pipeline Diagram
   ========================================================================== */
.cdc-pipe {
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0;
  padding: 1rem 0;
}

.cdc-pipe-stage {
  display: grid;
  grid-template-columns: 10rem minmax(0, 1fr) 16rem;
  align-items: center;
  gap: 1rem;
  width: 100%;
  padding: 1.1rem 0;
}

.cdc-pipe-label {
  text-align: right;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  color: #3d3a36;
  line-height: 1.3;
}

.cdc-pipe-label-num {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #a89b8c;
  margin-bottom: 0.15rem;
}

.cdc-pipe-label-desc {
  display: block;
  font-size: 0.65rem;
  color: #8b7355;
  font-style: italic;
  margin-top: 0.2rem;
  line-height: 1.4;
}

.cdc-pipe-visual {
  display: flex;
  flex-wrap: wrap;
  align-items: center;
  gap: 0.5rem;
  background: rgba(61, 58, 54, 0.025);
  border-radius: 6px;
  padding: 0.6rem 0.75rem;
  min-height: 2.5rem;
}

.cdc-pipe-visual.cdc-pipe-grid {
  display: grid;
  grid-template-columns: repeat(6, 1fr);
  justify-items: center;
}

.cdc-pipe-code {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  line-height: 1.5;
  color: #5a564f;
  border-left: 2px solid rgba(196, 90, 59, 0.3);
  padding-left: 0.6rem;
  white-space: pre;
}

.cdc-pipe-code .kw { color: #8b5cf6; }
.cdc-pipe-code .fn { color: #c45a3b; }
.cdc-pipe-code .cm { color: #a89b8c; font-style: italic; }
.cdc-pipe-code .str { color: #5a8a5a; }

.cdc-pipe-connector {
  display: flex;
  justify-content: center;
  padding: 0.15rem 0;
}

.cdc-pipe-connector::after {
  content: '';
  display: block;
  width: 2px;
  height: 20px;
  background: rgba(61, 58, 54, 0.2);
  position: relative;
}

.cdc-pipe-connector-arrow {
  display: flex;
  justify-content: center;
}

.cdc-pipe-connector-arrow::after {
  content: '';
  display: block;
  width: 0;
  height: 0;
  border-left: 5px solid transparent;
  border-right: 5px solid transparent;
  border-top: 6px solid rgba(61, 58, 54, 0.25);
}

.cdc-pipe-file {
  display: inline-flex;
  align-items: center;
  gap: 0.35rem;
  padding: 0.3rem 0.6rem;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.15);
  border-radius: 4px;
  box-shadow: 0 1px 3px rgba(0, 0, 0, 0.06);
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem;
  color: #3d3a36;
}

.cdc-pipe-file-icon {
  font-size: 0.85rem;
  opacity: 0.6;
}

.cdc-pipe-hash-label {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.55rem;
  color: #8b7355;
  text-align: center;
  margin-top: 0.15rem;
  letter-spacing: -0.02em;
}

.cdc-pipe-chunk-col {
  display: flex;
  flex-direction: column;
  align-items: center;
}

.cdc-pipe-result {
  display: inline-block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.55rem;
  font-weight: 600;
  padding: 0.1rem 0.35rem;
  border-radius: 3px;
  margin-top: 0.2rem;
  letter-spacing: 0.02em;
  text-transform: uppercase;
}

.cdc-pipe-result.exists {
  background: rgba(90, 138, 90, 0.15);
  color: #3d7a3d;
}

.cdc-pipe-result.new {
  background: rgba(196, 90, 59, 0.15);
  color: #a84832;
}

.cdc-pipe-branch {
  display: flex;
  gap: 1rem;
  width: 100%;
  justify-content: center;
}

.cdc-pipe-branch-arm {
  flex: 1;
  max-width: 14rem;
  padding: 0.6rem 0.75rem;
  border-radius: 6px;
  border: 1px dashed;
  text-align: center;
}

.cdc-pipe-branch-arm .cdc-pipe-branch-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  margin-bottom: 0.3rem;
}

.cdc-pipe-branch-arm .cdc-pipe-branch-desc {
  font-size: 0.7rem;
  line-height: 1.4;
  color: #5a564f;
}

.cdc-pipe-branch-arm.exists-arm {
  background: rgba(90, 138, 90, 0.06);
  border-color: rgba(90, 138, 90, 0.3);
}

.cdc-pipe-branch-arm.exists-arm .cdc-pipe-branch-title { color: #3d7a3d; }

.cdc-pipe-branch-arm.new-arm {
  background: rgba(196, 90, 59, 0.06);
  border-color: rgba(196, 90, 59, 0.3);
}

.cdc-pipe-branch-arm.new-arm .cdc-pipe-branch-title { color: #a84832; }

.cdc-pipe-summary {
  text-align: center;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  font-weight: 600;
  color: #3d3a36;
  padding: 0.75rem 1.25rem;
  background: linear-gradient(135deg, rgba(196, 90, 59, 0.06) 0%, rgba(212, 165, 116, 0.1) 100%);
  border-radius: 6px;
  margin-top: 0.5rem;
  width: 100%;
  box-sizing: border-box;
}

.cdc-pipe-stream-label {
  font-size: 0.65rem;
  color: #8b7355;
  font-style: italic;
}

/* Pipeline responsive: tablet */
@media (max-width: 50em) {
  .cdc-pipe-stage {
    grid-template-columns: 5rem 1fr;
    grid-template-rows: auto auto;
  }
  .cdc-pipe-code {
    grid-column: 1 / -1;
    margin-top: 0.25rem;
  }
}

/* Pipeline responsive: mobile */
@media (max-width: 42em) {
  .cdc-pipe-stage {
    grid-template-columns: 1fr;
    text-align: center;
  }
  .cdc-pipe-label {
    text-align: center;
  }
  .cdc-pipe-code {
    white-space: pre-wrap;
    word-break: break-all;
    grid-column: 1;
  }
  .cdc-pipe-branch {
    flex-direction: column;
    align-items: center;
  }
  .cdc-pipe-branch-arm {
    max-width: 100%;
    width: 100%;
  }
}

/* Cost Tradeoffs Explorer */
.cost-bars-container {
  display: flex;
  flex-direction: column;
  gap: 0.75rem;
  padding: 0 1.25rem 1rem;
}

.cost-bar-row {
  display: grid;
  grid-template-columns: 7rem 1fr 14rem;
  align-items: center;
  gap: 0.75rem;
}

.cost-bar-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  font-weight: bold;
  color: #3d3a36;
  text-align: right;
}

.cost-bar-track {
  height: 24px;
  background: rgba(61, 58, 54, 0.08);
  border-radius: 4px;
  overflow: hidden;
}

.cost-bar-fill {
  height: 100%;
  border-radius: 4px;
  transition: width 0.2s ease;
}

.cost-fill-cpu {
  background: linear-gradient(to right, #d4a574, #c45a3b);
}

.cost-fill-memory {
  background: linear-gradient(to right, #94b3c8, #5a7d94);
}

.cost-fill-network {
  background: linear-gradient(to right, #8ab88a, #5a8a5a);
}

.cost-fill-storage {
  background: linear-gradient(to right, #d4b878, #b8943b);
}

.cost-bar-annotation {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  color: #8b7355;
  line-height: 1.5;
  min-height: 3em;
}

.cost-bars-axis-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #a89b8c;
  text-align: right;
  padding: 0.25rem 1.25rem 0 0;
  display: grid;
  grid-template-columns: 7rem 1fr;
  gap: 0.75rem;
}

.cost-bars-axis-label span:first-child {
  /* empty grid cell to align with bar labels */
}

.cost-bars-axis-note {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  color: #8b7355;
  padding: 0.5rem 1.25rem 0;
  line-height: 1.5;
}

@media (max-width: 50em) {
  .cost-bar-row {
    grid-template-columns: 5rem 1fr 10rem;
  }
}

@media (max-width: 42em) {
  .cost-bar-row {
    grid-template-columns: 1fr;
    gap: 0.25rem;
  }
  .cost-bar-label {
    text-align: left;
  }
  .cost-bar-track {
    height: 20px;
  }
}

/* Cloud Cost Table */
.cost-cloud-section {
  padding: 0 1.25rem 0.5rem;
  border-top: 1px solid rgba(61, 58, 54, 0.1);
  margin-top: 0.5rem;
  padding-top: 1rem;
}

.cost-cloud-header {
  display: flex;
  flex-wrap: wrap;
  justify-content: space-between;
  align-items: baseline;
  gap: 0.5rem;
  margin-bottom: 0.75rem;
}

.cost-cloud-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: bold;
  color: #3d3a36;
}

.cost-cloud-workload {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem;
  color: #8b7355;
}

.cost-cloud-table {
  width: 100%;
  border-collapse: collapse;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  margin-bottom: 0.75rem;
}

.cost-cloud-table th {
  font-weight: bold;
  color: #3d3a36;
  text-align: right;
  padding: 0.35rem 0.5rem;
  border-bottom: 2px solid rgba(61, 58, 54, 0.15);
}

.cost-cloud-table th:first-child {
  text-align: left;
}

.cost-cloud-table td {
  padding: 0.35rem 0.5rem;
  text-align: right;
  color: #3d3a36;
  border-bottom: 1px solid rgba(61, 58, 54, 0.07);
}

.cost-cloud-table td:first-child {
  text-align: left;
  color: #8b7355;
}

.cost-cloud-table tr:last-child td {
  font-weight: bold;
  border-top: 2px solid rgba(61, 58, 54, 0.15);
  border-bottom: none;
}

.cost-cell-value {
  display: block;
}

.cost-cell-calc {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #a08b6e;
  line-height: 1.3;
}

.cost-cloud-assumptions {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.65rem;
  color: #8b7355;
  line-height: 1.5;
}

.cost-cloud-assumptions a {
  color: #c45a3b;
}

@media (max-width: 42em) {
  .cost-cloud-table {
    font-size: 0.65rem;
  }
  .cost-cloud-table th,
  .cost-cloud-table td {
    padding: 0.25rem 0.3rem;
  }
  .cost-cell-calc {
    display: none;
  }
}

.cost-pricing-ref {
  margin: 0.5rem 0 0.75rem;
}
.cost-pricing-ref summary {
  cursor: pointer;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  color: #8b7355;
  list-style: none;
  padding: 0.25rem 0;
}
.cost-pricing-ref summary::-webkit-details-marker { display: none; }
.cost-pricing-ref summary::marker { display: none; }
.cost-pricing-ref summary::before {
  content: "\25B8  ";
  display: inline-block;
  transition: transform 0.2s;
}
.cost-pricing-ref[open] summary::before {
  transform: rotate(90deg);
}
.cost-ref-table tr:last-child td {
  font-weight: normal;
  border-top: none;
  border-bottom: 1px solid rgba(61, 58, 54, 0.07);
}

/* Footnotes */
.cdc-footnote-marker {
  font-size: 0.75em;
  vertical-align: super;
  line-height: 0;
}
.cdc-footnote-marker a {
  color: #c45a3b;
  text-decoration: none;
  font-weight: 600;
}
.cdc-footnote-marker a:hover {
  text-decoration: underline;
}
.cdc-footnotes {
  margin-top: 2rem;
  padding-top: 1rem;
  border-top: 1px solid rgba(61, 58, 54, 0.1);
  font-size: 0.85rem;
  color: #5a5550;
  line-height: 1.7;
}
.cdc-footnotes ol {
  padding-left: 1.5rem;
  margin: 0;
}
.cdc-footnotes li {
  margin-bottom: 0.75rem;
}
.cdc-footnotes .back-ref {
  color: #c45a3b;
  text-decoration: none;
  margin-left: 0.25rem;
}
.cdc-footnotes .back-ref:hover {
  text-decoration: underline;
}

/* Cost dimension tabs */
.cdc-cost-tabs {
  margin: 1.5rem 0;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 6px;
  background: #fff;
}

.cdc-tab-bar {
  display: flex;
  gap: 0;
  border-bottom: 1px solid rgba(61, 58, 54, 0.1);
}

.cdc-tab {
  background: none;
  border: none;
  padding: 0.6rem 1.25rem;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem;
  color: #6b6560;
  cursor: pointer;
  position: relative;
  transition: color 0.2s;
}

.cdc-tab:hover {
  color: #3d3a36;
}

.cdc-tab.active {
  color: #3d3a36;
  font-weight: 600;
}

.cdc-tab.active::after {
  content: '';
  position: absolute;
  bottom: -1px;
  left: 0;
  right: 0;
  height: 2px;
  background: #c45a3b;
}

.cdc-tab-panels {
  padding: 1.25rem 1.5rem;
}

.cdc-tab-panel {
  display: none;
  line-height: 1.6;
}

.cdc-tab-panel.active {
  display: block;
}

.cdc-tab-panel .cdc-cost-summary {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.88rem;
  font-style: italic;
  color: #6b6560;
  margin-top: 0;
  margin-bottom: 1rem;
}

@media (max-width: 42em) {
  .cdc-tab { padding: 0.5rem 0.75rem; font-size: 0.8rem; }
  .cdc-tab-panels { padding: 1rem; }
}
</style>

<div class="cdc-series-nav">
Part 3 of 5 in a series on Content-Defined Chunking. Previous: <a href="/writings/content-defined-chunking-part-2">Part 2: A Deep Dive into FastCDC</a> · Next: <a href="/writings/content-defined-chunking-part-4">Part 4: CDC in the Cloud</a>
</div>

<p>In <a href="/writings/content-defined-chunking-part-1">Part 1</a>, we explored why content-defined chunking exists and surveyed three algorithm families. In <a href="/writings/content-defined-chunking-part-2">Part 2</a>, we took a deep dive into FastCDC’s Gear hash, normalized chunking, and how average byte targets affect chunk distribution. In this post, we bring the pieces together to see deduplication in action, examine where CDC is used in practice today and where it is not, and explore the cost tradeoffs that shape real-world systems.</p>

<hr />

<h2 id="deduplication-in-action">Deduplication in Action</h2>

<p>Imagine you are building a system to store files that change over time. Each new version is mostly the same as the last, with only a small edit here or there. As we saw in <a href="/writings/content-defined-chunking-part-1">Part 1</a>, storing a complete copy of every version wastes storage on identical content. Content-defined chunking offers a way out: split each version into chunks based on content, fingerprint each chunk, and only store chunks you have not seen before. But what does that actually look like in practice? The explorer below runs FastCDC on editable text. A small edit has already been saved to show deduplication at work: most chunks are recognized as duplicates and never stored again. Try making your own edits to see how content-defined boundaries respond.</p>

<div class="cdc-viz">
<div class="cdc-viz-header">
  <span class="cdc-viz-title">Deduplication Explorer</span>
</div>
<p class="cdc-viz-hint">Click "Save Version" after editing to see which chunks are new and which are shared. Hover over chunks to highlight them across views.</p>
<div class="cdc-dedup-viz" id="dedup-demo">
  <!-- Populated dynamically by VersionedDedupDemo -->
</div>
</div>

<p>As the demo shows, even a small edit only produces a handful of new chunks while the rest are shared across versions. But how does this work under the hood?</p>

<h3 id="the-deduplication-pipeline">The Deduplication Pipeline</h3>

<p>Recall the system to store files that change over time, where the goal is to avoid writing identical content twice. Implementations vary widely, but any CDC-based deduplication system needs the same core ingredients: a way to split data into chunks, a way to fingerprint each chunk, and a way to check whether that fingerprint has been seen before. The visualization below walks through these ingredients as a simple linear pipeline, though real systems will likely optimize by reordering, parallelizing, or batching these steps.</p>

<div class="cdc-viz">
<div class="cdc-viz-header">
  <span class="cdc-viz-title">Pipeline</span>
</div>
<div class="cdc-pipe">

  <!-- Stage 01: File Input -->
  <div class="cdc-pipe-stage">
    <div class="cdc-pipe-label">
      <span class="cdc-pipe-label-num">01</span>
      File Input
      <span class="cdc-pipe-label-desc">Read the file as a raw byte stream.</span>
    </div>
    <div class="cdc-pipe-visual">
      <span class="cdc-pipe-file"><span class="cdc-pipe-file-icon">&#128196;</span> document.txt</span>
      <span class="cdc-pipe-stream-label">raw byte stream</span>
    </div>
    <div class="cdc-pipe-code"><span class="kw">let</span> data = <span class="fn">fs::read</span>(<span class="str">"document.txt"</span>);</div>
  </div>

  <div class="cdc-pipe-connector"></div>
  <div class="cdc-pipe-connector-arrow"></div>

  <!-- Stage 02: CDC Chunking -->
  <div class="cdc-pipe-stage">
    <div class="cdc-pipe-label">
      <span class="cdc-pipe-label-num">02</span>
      CDC Chunking
      <span class="cdc-pipe-label-desc">Split the stream into variable-size chunks using content-defined boundaries.</span>
    </div>
    <div class="cdc-pipe-visual cdc-pipe-grid">
      <span class="cdc-chunk chunk-a" style="font-size:0.75rem;">A</span>
      <span class="cdc-chunk chunk-b" style="font-size:0.75rem;">B</span>
      <span class="cdc-chunk chunk-c" style="font-size:0.75rem;">C</span>
      <span class="cdc-chunk chunk-d" style="font-size:0.75rem;">D</span>
      <span class="cdc-chunk chunk-e" style="font-size:0.75rem;">E</span>
      <span class="cdc-chunk chunk-new" style="font-size:0.75rem;">F</span>
    </div>
    <div class="cdc-pipe-code"><span class="kw">let</span> chunks = <span class="fn">FastCDC::new</span>(
  &amp;data, min, avg, max
);</div>
  </div>

  <div class="cdc-pipe-connector"></div>
  <div class="cdc-pipe-connector-arrow"></div>

  <!-- Stage 03: Hash Each Chunk -->
  <div class="cdc-pipe-stage">
    <div class="cdc-pipe-label">
      <span class="cdc-pipe-label-num">03</span>
      Hash Each Chunk
      <span class="cdc-pipe-label-desc">Compute a collision-resistant fingerprint for each chunk.</span>
    </div>
    <div class="cdc-pipe-visual cdc-pipe-grid">
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-a" style="font-size:0.75rem; padding: 0.3rem 0.5rem;">A</span>
        <span class="cdc-pipe-hash-label">7f3a9b2c</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-b" style="font-size:0.75rem; padding: 0.3rem 0.5rem;">B</span>
        <span class="cdc-pipe-hash-label">e2b10f87</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-c" style="font-size:0.75rem; padding: 0.3rem 0.5rem;">C</span>
        <span class="cdc-pipe-hash-label">91cda4e3</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-d" style="font-size:0.75rem; padding: 0.3rem 0.5rem;">D</span>
        <span class="cdc-pipe-hash-label">a4f8c61d</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-e" style="font-size:0.75rem; padding: 0.3rem 0.5rem;">E</span>
        <span class="cdc-pipe-hash-label">c3d752af</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-new" style="font-size:0.75rem; padding: 0.3rem 0.5rem;">F</span>
        <span class="cdc-pipe-hash-label">58eab03e</span>
      </div>
    </div>
    <div class="cdc-pipe-code"><span class="kw">for</span> chunk <span class="kw">in</span> chunks {
  hash = <span class="fn">blake3</span>(chunk.data);
}</div>
  </div>

  <div class="cdc-pipe-connector"></div>
  <div class="cdc-pipe-connector-arrow"></div>

  <!-- Stage 04: Store Lookup -->
  <div class="cdc-pipe-stage">
    <div class="cdc-pipe-label">
      <span class="cdc-pipe-label-num">04</span>
      Store Lookup
      <span class="cdc-pipe-label-desc">Check whether each fingerprint already exists in the chunk store.</span>
    </div>
    <div class="cdc-pipe-visual cdc-pipe-grid">
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-a unchanged" style="font-size:0.75rem;">A</span>
        <span class="cdc-pipe-result exists">exists</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-b" style="font-size:0.75rem;">B</span>
        <span class="cdc-pipe-result new">new!</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-c unchanged" style="font-size:0.75rem;">C</span>
        <span class="cdc-pipe-result exists">exists</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-d unchanged" style="font-size:0.75rem;">D</span>
        <span class="cdc-pipe-result exists">exists</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-e unchanged" style="font-size:0.75rem;">E</span>
        <span class="cdc-pipe-result exists">exists</span>
      </div>
      <div class="cdc-pipe-chunk-col">
        <span class="cdc-chunk chunk-new" style="font-size:0.75rem;">F</span>
        <span class="cdc-pipe-result new">new!</span>
      </div>
    </div>
    <div class="cdc-pipe-code"><span class="kw">let</span> known = <span class="fn">store.contains</span>(hash);</div>
  </div>

  <div class="cdc-pipe-connector"></div>
  <div class="cdc-pipe-connector-arrow"></div>

  <!-- Stage 05: Store Decision -->
  <div class="cdc-pipe-stage">
    <div class="cdc-pipe-label">
      <span class="cdc-pipe-label-num">05</span>
      Store Decision
      <span class="cdc-pipe-label-desc">Write only new chunks; record a reference for duplicates.</span>
    </div>
    <div class="cdc-pipe-visual" style="background: none; padding: 0;">
      <div class="cdc-pipe-branch">
        <div class="cdc-pipe-branch-arm exists-arm">
          <div class="cdc-pipe-branch-title">Hash Exists</div>
          <div class="cdc-pipe-branch-desc">Skip write, record reference</div>
        </div>
        <div class="cdc-pipe-branch-arm new-arm">
          <div class="cdc-pipe-branch-title">Hash New</div>
          <div class="cdc-pipe-branch-desc">Write chunk, register hash</div>
        </div>
      </div>
    </div>
    <div class="cdc-pipe-code"><span class="kw">if</span> known {
  <span class="fn">ref</span>(hash)
} <span class="kw">else</span> {
  <span class="fn">store.put</span>(hash, data)
}</div>
  </div>

</div>
</div>

<p>Walking through the stages: the raw file bytes enter the pipeline and are split into variable-size chunks by the CDC algorithm (here, FastCDC). Each chunk is then fingerprinted with a cryptographic hash like BLAKE3. The system looks up each fingerprint in the chunk store. Chunks that already exist are skipped with only a reference recorded, while genuinely new chunks are written to storage and their fingerprints registered for future lookups. The result is that storing a new version of a file costs only the new chunks, not a full copy.</p>

<p>Each stage in the pipeline maps to just a few lines of code, but together they form a system where redundant data is identified and eliminated before it ever reaches disk or network. When a file changes, only the chunks that were actually modified produce new hashes. The rest match what is already in the store, so they are never written again.</p>

<h3 id="the-core-cost-tradeoffs">The Core Cost Tradeoffs</h3>

<p>Deduplication is not free. Every stage of the pipeline above consumes resources, and the central engineering challenge is deciding where to spend and where to save.<span class="cdc-cite"><a href="#ref-15">[15]</a></span> The core CDC costs fall into four categories, and they all interact.</p>

<div class="cdc-cost-tabs" id="cost-tabs">
  <div class="cdc-tab-bar" role="tablist">
    <button class="cdc-tab active" role="tab" aria-selected="true" data-tab="cpu">CPU</button>
    <button class="cdc-tab" role="tab" aria-selected="false" data-tab="memory">Memory</button>
    <button class="cdc-tab" role="tab" aria-selected="false" data-tab="network">Network</button>
    <button class="cdc-tab" role="tab" aria-selected="false" data-tab="storage">Storage</button>
  </div>
  <div class="cdc-tab-panels">
  <div class="cdc-tab-panel active" id="tab-cpu" role="tabpanel">
    <p class="cdc-cost-summary"><em>Hashing, compression, and chunking</em></p>
    <p>CPU is the first cost you pay, and it shows up in three places: the rolling hash that finds chunk boundaries, the cryptographic hash that fingerprints each chunk, and the compression that shrinks chunks before storage.</p>
    <p>The rolling hash itself is cheap: as we saw in <a href="/writings/content-defined-chunking-part-2.html">Part 2</a>, Gear hash processes each byte with just a shift and a table lookup. The cryptographic hash that follows is the primary CPU bottleneck. SHA-256 and BLAKE3 must process every byte of every chunk to produce a collision-resistant fingerprint.<span id="fn1-ref" class="cdc-footnote-marker"><a href="#fn1">1</a></span> With fast chunking algorithms like FastCDC, fingerprinting dominates the CPU profile of the pipeline.<span class="cdc-cite"><a href="#ref-17">[17]</a></span> Stronger hashes cost more cycles but reduce the probability of two different chunks sharing the same hash to effectively zero.</p>
    <p>Then there is compression: most production systems (Restic, Borg, and others) compress each chunk before storing it, typically with zstd or LZ4. Compression adds meaningful CPU cost on writes and a smaller cost on reads (decompression), but it can dramatically reduce the bytes that actually hit disk and network.</p>
    <p>All three costs scale linearly with data volume. In practice, BLAKE3 is fast enough that hashing rarely bottlenecks a modern pipeline, and modern compressors like zstd offer tunable speed-vs-ratio tradeoffs, but both represent real work on every byte that enters the system. Systems whose chunks have predictable internal structure can push further: Meta's <a href="https://openzl.org/">OpenZL</a> generates compressors tailored to a specific data format, achieving better compression ratios at higher speeds than general-purpose tools can manage.<span class="cdc-cite"><a href="#ref-22">[22]</a></span></p>
  </div>
  <div class="cdc-tab-panel" id="tab-memory" role="tabpanel">
    <p class="cdc-cost-summary"><em>Chunk index lookups at scale</em></p>
    <p>Memory is where the chunk index lives. The content-addressable store needs a searchable mapping from hash to storage location, and that index must be fast to query because every incoming chunk triggers a lookup. At scale, keeping a full chunk index in RAM becomes impractical, and a disk-based index with one seek per incoming chunk is far too slow.<span class="cdc-cite"><a href="#ref-16">[16]</a></span><span class="cdc-cite"><a href="#ref-18">[18]</a></span></p>
    <p>The primary cost driver is chunk count. The index size scales with the number of unique chunks, not with total data volume, which is good. But smaller average chunk sizes mean more chunks per file, which means a larger index. A system with 4 KB average chunks will produce roughly four times as many index entries as one with 16 KB chunks for the same data.</p>
    <p>Once the index outgrows a single machine, or needs to be shared across a fleet, it becomes a distributed systems problem: you need a persistent, highly available data store (typically a database or distributed key-value system) to hold the mapping and serve lookups at low latency. That infrastructure has its own operational cost, and it scales with chunk count.</p>
  </div>
  <div class="cdc-tab-panel" id="tab-network" role="tabpanel">
    <p class="cdc-cost-summary"><em>Transfer efficiency through deduplication</em></p>
    <p>Network is often where deduplication pays for itself most visibly. In distributed systems (backup to a remote server, syncing across devices), only new chunks need to traverse the wire. The primary cost driver is the dedup ratio: the fraction of chunks that already exist at the destination and never need to be sent.</p>
    <p>The <a href="https://pdos.csail.mit.edu/archive/lbfs/">Low-Bandwidth Network File System</a> (LBFS) demonstrated this early on, achieving over an order of magnitude less bandwidth than traditional network file systems by transmitting only chunks not already present at the receiver.<span class="cdc-cite"><a href="#ref-19">[19]</a></span> If you edit a paragraph in a 10 MB document and the system produces 200 chunks, perhaps only 3 of those are new. That is a transfer of kilobytes instead of megabytes.</p>
    <p>Smaller chunks generally improve this ratio because edits are less likely to span an entire small chunk, but each chunk also carries metadata overhead (its hash, its length, its position in the manifest), so there is a point of diminishing returns.</p>
  </div>
  <div class="cdc-tab-panel" id="tab-storage" role="tabpanel">
    <p class="cdc-cost-summary"><em>Unique chunks plus metadata overhead</em></p>
    <p>Storage (disk or object store) holds the unique chunks plus all the metadata that lets you reconstruct files from them: hashes, chunk-to-file mappings, version manifests. The primary cost driver is the balance between dedup savings and metadata overhead. Smaller chunks improve deduplication (more sharing opportunities), but they also increase the metadata-to-data ratio.<span class="cdc-cite"><a href="#ref-21">[21]</a></span> On cloud object stores, chunk count also drives API operations costs: every PUT and GET is priced per request, so more chunks means more billable operations (Part 4 explores how containers address this).</p>
    <p>At extremely small chunk sizes (say, 256 bytes), the overhead of storing a 32-byte hash and associated bookkeeping for each chunk becomes a significant fraction of the chunk itself.</p>
    <p>Meyer and Bolosky found that for live desktop file systems, where most duplication consists of identical files stored in multiple locations, whole-file deduplication already captures roughly 75% of the savings of fine-grained block-level dedup.<span class="cdc-cite"><a href="#ref-20">[20]</a></span> But that result is workload-dependent. When files churn frequently and edits are localized within larger files (the pattern that dominates backup, sync, and software distribution), whole-file dedup sees zero savings on each modified file while CDC captures nearly everything. The value of sub-file chunking scales with both how much duplicated content exists and how frequently that content changes.</p>
  </div>
  </div>
</div>

<div class="cdc-callout" data-label="The Central Knob">
Average chunk size is the single parameter that ties all four costs together.<span class="cdc-cite"><a href="#ref-15">[15]</a></span><span class="cdc-cite"><a href="#ref-21">[21]</a></span> Turning it down (smaller chunks) improves deduplication ratio and network efficiency but increases CPU work, index memory, and metadata overhead. Turning it up (larger chunks) reduces overhead but sacrifices dedup granularity. The right setting depends on your domain.
</div>

<p>The explorer below visualizes these four dimensions as you move the chunk size slider. Small chunks push CPU, memory, and metadata overhead up while improving deduplication and network efficiency. Large chunks do the reverse. The sweet spot depends on your workload.</p>

<div class="cdc-viz" id="cost-tradeoffs-demo">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Cost Tradeoffs Explorer</span>
  </div>
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Average Chunk Size: <strong id="cost-tradeoffs-slider-value">32 KB</strong>
    </span>
    <input type="range" id="cost-tradeoffs-slider" min="0" max="100" value="50" step="1" />
    <span class="cdc-viz-hint" style="flex-basis: 100%;">Drag the slider to see how average chunk size affects each cost dimension.</span>
  </div>
  <div class="cost-bars-axis-label">
    <span></span>
    <span>Relative Pressure &rarr;</span>
  </div>
  <div class="cost-bars-container" id="cost-tradeoffs-bars">
  </div>
  <div class="cost-bars-axis-note">
    These bars show the <em>direction</em> and <em>shape</em> of each tradeoff, not exact magnitudes. CPU and memory costs scale with chunk count (more chunks = more hashing, larger index). Network cost decreases with smaller chunks because the higher deduplication ratio means less unique data to transfer. Storage has a U-shape: very small chunks incur metadata overhead, while very large chunks reduce deduplication and store more redundant data.
  </div>
</div>

<p>If you experimented with the <a href="/writings/content-defined-chunking-part-2#parametric-demo">Parametric Chunking Explorer</a> in Part 2, you saw this tradeoff firsthand: smaller average sizes produced more chunks with tighter size distributions, while larger averages produced fewer, more variable chunks. Those demos showed the statistical effect. In production, the right balance depends on your workload: the volume of duplicated content, the rate at which that content changes, and how each cost dimension (CPU, memory, network, storage) maps onto your constraints. These are the competing forces that determine whether CDC is a valuable strategy for your system, and if so, what average chunk size best balances them. That answer depends on your domain, and only you as the expert in your particular system can make that call. My hope is that the intuitions developed here help you make a more informed decision.</p>

<h3 id="where-cdc-lives-today">Where CDC Lives Today</h3>

<p>Content-defined chunking has become infrastructure, often invisible but always essential. It shows up across three broad categories: backup and archival, file sync and distribution, and content-addressable storage.</p>

<p><strong>Backup and archival tools</strong> were the earliest adopters. <strong><a href="https://restic.net/">Restic</a></strong> uses Rabin fingerprints with configurable chunk sizes,<span class="cdc-cite"><a href="#ref-29">[29]</a></span> <strong><a href="https://www.borgbackup.org/">Borg</a></strong> uses Buzhash with a secret seed (preventing attackers from predicting chunk boundaries based on known content),<span class="cdc-cite"><a href="#ref-30">[30]</a></span> and newer tools like <strong><a href="https://kopia.io/">Kopia</a></strong>,<span class="cdc-cite"><a href="#ref-31">[31]</a></span> <strong><a href="https://duplicacy.com/">Duplicacy</a></strong>,<span class="cdc-cite"><a href="#ref-32">[32]</a></span> <strong><a href="https://bupstash.io/">Bupstash</a></strong>,<span class="cdc-cite"><a href="#ref-33">[33]</a></span> and <strong><a href="https://www.tarsnap.com/">Tarsnap</a></strong><span class="cdc-cite"><a href="#ref-34">[34]</a></span> all rely on CDC to deduplicate across snapshots. The pattern is the same in each: split data into content-defined chunks, fingerprint each chunk, and store only the unique ones.</p>

<p><strong>File sync and software distribution</strong> use CDC to minimize transfer sizes. <strong><a href="https://technology.riotgames.com/news/supercharging-data-delivery-new-league-patcher">Riot Games</a></strong> rebuilt the League of Legends patcher around FastCDC, replacing an older binary-delta system and achieving a tenfold improvement in patching speeds.<span class="cdc-cite"><a href="#ref-27">[27]</a></span> <strong><a href="https://github.com/systemd/casync">casync</a></strong>, created by Lennart Poettering, applies CDC to OS and container image distribution, chunking across file boundaries so that updates to a filesystem image only transfer the chunks that actually changed.<span class="cdc-cite"><a href="#ref-35">[35]</a></span></p>

<p><strong>Content-addressable storage</strong> systems like <strong><a href="https://ipfs.tech/">IPFS</a></strong> use CDC to split files into variable-size blocks before distributing them across a peer-to-peer network.<span class="cdc-cite"><a href="#ref-28">[28]</a></span> Because chunk boundaries are determined by content rather than position, identical regions of different files naturally converge on the same chunks and the same content addresses.</p>

<h3 id="when-cdc-is-not-the-right-choice">When CDC Is Not the Right Choice</h3>

<p>Not every system chooses CDC, and the cost tradeoffs help explain why. CDC optimizes for one thing above all: stable chunk boundaries across edits. That stability enables fine-grained deduplication, but it comes at a cost, and not every application prioritizes deduplication over other concerns.</p>

<p>Dropbox is the most prominent example. Their architecture uses fixed-size 4 MiB blocks with SHA-256 hashing, and has since the early days of the product.<span class="cdc-cite"><a href="#ref-23">[23]</a></span> Dropbox’s primary engineering challenge was not deduplication, it was <em>transport</em>: syncing files across hundreds of millions of devices as fast as possible while keeping infrastructure costs predictable.</p>

<p>Fixed-size blocks give Dropbox properties that CDC cannot. Block <em>N</em> always starts at offset <code class="language-plaintext highlighter-rouge">N * 4 MiB</code>, so a client can request any block without first receiving a boundary list. Upload work can be split across threads by byte offset with zero coordination, because boundaries are known before the content is read. The receiver knows when each block ends, enabling Dropbox’s streaming sync architecture where downloads begin before the upload finishes, achieving up to 2x improvement on large file sync.<span class="cdc-cite"><a href="#ref-23">[23]</a></span> And because every block is exactly 4 MiB (except the last), memory allocation, I/O scheduling, and storage alignment are all simple to model and predict at scale.</p>

<p>There is also the metadata question. CDC’s chunk index must be backed by a persistent, highly available data store once it outgrows a single machine. For Dropbox, serving hundreds of millions of users, the difference between a fixed-size block index and a variable-size CDC chunk index is not just memory; it is the size and complexity of the metadata infrastructure required to support it. Fixed-size blocks produce fewer, more predictable index entries, which simplifies that infrastructure considerably.</p>

<p>The tradeoff is real. The QuickSync study found that a minor edit in Dropbox can generate sync traffic 10x the size of the actual modification, because insertions shift every subsequent block boundary.<span class="cdc-cite"><a href="#ref-25">[25]</a></span> This is precisely the boundary-shift problem that CDC was designed to solve, as we explored in <a href="/writings/content-defined-chunking-part-1">Part 1</a>. But Dropbox chose to absorb that cost and compensate elsewhere: their Broccoli compression encoder achieves ~33% upload bandwidth savings<span class="cdc-cite"><a href="#ref-24">[24]</a></span>, and the streaming sync architecture pipelines work so effectively that the extra bytes matter less than they otherwise would.</p>

<p>In short, Dropbox traded storage efficiency for transport speed and operational simplicity. Fixed-size blocks mean a predictable object count, making it straightforward to forecast API costs on cloud storage where every PUT and GET is a line item. The ability to parallelize everything without content-dependent coordination was worth more than the deduplication gains CDC would have provided.</p>

<p><strong>Seafile</strong>, an open-source file sync platform, takes the opposite approach: it uses Rabin fingerprint-based CDC with ~1 MB average chunks to achieve block-level deduplication across file versions and libraries.<span class="cdc-cite"><a href="#ref-26">[26]</a></span> Where Dropbox chose to optimize purely for transport, Seafile shows that CDC-based sync systems can work in practice. <a href="/writings/content-defined-chunking-part-4">Part 4</a> explores how the container abstraction makes this economically viable.</p>

<h3 id="why-cloud-storage-is-the-cost-that-matters">Why Cloud Storage is the Cost that Matters</h3>

<p>The four dimensions above assume a local or self-managed storage backend where the cost of writing and reading an object is just disk I/O. Most production systems today do not work that way. They store chunks on cloud object storage (S3, GCS, or Azure Blob Storage), and cloud providers turn every one of those engineering costs into a line item on a bill.</p>

<p>Cloud providers charge not just per GB stored but also per API operation. Every PUT and every GET has a price. When each chunk is its own object, the number of API calls scales with the number of chunks, and that operations cost can dominate the bill entirely. The same knob that the explorer above illustrates (smaller chunks improve dedup but increase chunk count) takes on a new, financially painful dimension: more chunks means more API calls means a larger cloud bill, even if the total bytes stored are fewer.</p>

<p>This is the problem that <a href="/writings/content-defined-chunking-part-4">Part 4: CDC in the Cloud</a> tackles head-on. Grouping chunks into larger, fixed-size containers collapses the object count and makes CDC viable on cloud storage. But containers introduce their own challenges: fragmentation, garbage collection complexity, and restore performance degradation. <a href="/writings/content-defined-chunking-part-5">Part 5</a> then takes a deep dive into the full cost picture, exploring how different storage providers, caching layers, and container configurations combine to determine the real monthly bill.</p>
<div class="cdc-footnotes">
<ol>
<li id="fn1">Collision resistance requires that it is computationally infeasible to find two different inputs that produce the same hash. For this guarantee to hold, every bit of the input must influence the output. If the function skipped even a single byte, two inputs differing only in that byte would hash identically, a trivial collision. This is the fundamental difference from rolling hashes used for boundary detection: Gear hash only looks at a sliding window and is not collision-resistant, which is fine for finding chunk boundaries but not for content addressing, where a collision means two different chunks are treated as identical and one gets silently discarded. BLAKE3 is notably faster than SHA-256 here because it uses a Merkle tree structure internally, allowing parts of the input to be hashed in parallel across cores and SIMD lanes, but it still processes every byte. <a href="#fn1-ref" class="back-ref">&#8617;</a></li>
</ol>
</div>

<h3 id="references">References</h3>

<div class="cdc-references">

<div class="bib-entry" id="ref-15">
  <div class="bib-number">[15]</div>
  <div class="bib-citation">W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang &amp; Y. Zhou, "A Comprehensive Study of the Past, Present, and Future of Data Deduplication," <em>Proceedings of the IEEE</em>, vol. 104, no. 9, pp. 1681-1710, September 2016.</div>
  <div class="bib-links">
    <a href="https://ieeexplore.ieee.org/document/7529062" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE</a>
  </div>
</div>

<div class="bib-entry" id="ref-16">
  <div class="bib-number">[16]</div>
  <div class="bib-citation">B. Zhu, K. Li &amp; H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," <em>6th USENIX Conference on File and Storage Technologies (FAST '08)</em>, San Jose, CA, February 2008.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/legacy/event/fast08/tech/full_papers/zhu/zhu.pdf" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> PDF</a>
  </div>
</div>

<div class="bib-entry" id="ref-17">
  <div class="bib-number">[17]</div>
  <div class="bib-citation">W. Xia, X. Zou, Y. Zhou, H. Jiang, C. Liu, D. Feng, Y. Hua, Y. Hu &amp; Y. Zhang, "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems," <em>IEEE Transactions on Parallel and Distributed Systems</em>, vol. 31, no. 9, pp. 2017-2031, 2020.</div>
  <div class="bib-links">
    <a href="https://csyhua.github.io/csyhua/hua-tpds2020-dedup.pdf" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> PDF</a>
  </div>
</div>

<div class="bib-entry" id="ref-18">
  <div class="bib-number">[18]</div>
  <div class="bib-citation">M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise &amp; P. Camble, "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," <em>7th USENIX Conference on File and Storage Technologies (FAST '09)</em>, San Jose, CA, February 2009.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/conference/fast-09/sparse-indexing-large-scale-inline-deduplication-using-sampling-and-locality" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-19">
  <div class="bib-number">[19]</div>
  <div class="bib-citation">A. Muthitacharoen, B. Chen &amp; D. Mazi&egrave;res, "A Low-bandwidth Network File System," <em>18th ACM Symposium on Operating Systems Principles (SOSP '01)</em>, Banff, Canada, October 2001.</div>
  <div class="bib-links">
    <a href="https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> PDF</a>
  </div>
</div>

<div class="bib-entry" id="ref-20">
  <div class="bib-number">[20]</div>
  <div class="bib-citation">D. T. Meyer &amp; W. J. Bolosky, "A Study of Practical Deduplication," <em>9th USENIX Conference on File and Storage Technologies (FAST '11)</em>, San Jose, CA, February 2011.</div>
  <div class="bib-links">
    <a href="https://www.usenix.org/legacy/event/fast11/tech/full_papers/Meyer.pdf" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> PDF</a>
  </div>
</div>

<div class="bib-entry" id="ref-21">
  <div class="bib-number">[21]</div>
  <div class="bib-citation">H. Wu, C. Wang, K. Lu, Y. Fu &amp; L. Zhu, "One Size Does Not Fit All: The Case for Chunking Configuration in Backup Deduplication," <em>18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '18)</em>, 2018.</div>
  <div class="bib-links">
    <a href="https://ieeexplore.ieee.org/document/8411025" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE</a>
  </div>
</div>

<div class="bib-entry" id="ref-22">
  <div class="bib-number">[22]</div>
  <div class="bib-citation">Y. Collet, N. Terrell, W. F. Handte, D. Rozenblit, V. Zhang, K. Zhang, Y. Goldschlag, J. Lee, E. Gorokhovsky, Y. Komornik, D. Riegel, S. Angelov &amp; N. Rotem, "OpenZL: A Graph-Based Model for Compression," <em>arXiv:2510.03203</em>, October 2025.</div>
  <div class="bib-links">
    <a href="https://arxiv.org/abs/2510.03203" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> arXiv</a>
    <a href="https://openzl.org/" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Project</a>
  </div>
</div>

<div class="bib-entry" id="ref-23">
  <div class="bib-number">[23]</div>
  <div class="bib-citation">N. Koorapati, "Streaming File Synchronization," <em>Dropbox Tech Blog</em>, July 2014.</div>
  <div class="bib-links">
    <a href="https://dropbox.tech/infrastructure/streaming-file-synchronization" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Blog</a>
  </div>
</div>

<div class="bib-entry" id="ref-24">
  <div class="bib-number">[24]</div>
  <div class="bib-citation">R. Jain &amp; D. R. Horn, "Broccoli: Syncing Faster by Syncing Less," <em>Dropbox Tech Blog</em>, August 2020.</div>
  <div class="bib-links">
    <a href="https://dropbox.tech/infrastructure/-broccoli--syncing-faster-by-syncing-less" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Blog</a>
  </div>
</div>

<div class="bib-entry" id="ref-25">
  <div class="bib-number">[25]</div>
  <div class="bib-citation">Y. Cui, Z. Lai, N. Dai &amp; X. Wang, "QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services," <em>IEEE Transactions on Mobile Computing</em>, vol. 16, no. 12, pp. 3513-3526, 2017.</div>
  <div class="bib-links">
    <a href="https://ieeexplore.ieee.org/document/7898362" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE</a>
  </div>
</div>

<div class="bib-entry" id="ref-26">
  <div class="bib-number">[26]</div>
  <div class="bib-citation">Seafile Ltd., "Data Model," <em>Seafile Administration Manual</em>. CDC implementation: <a href="https://github.com/haiwen/seafile-server/blob/master/common/cdc/cdc.c">seafile-server/common/cdc/cdc.c</a>.</div>
  <div class="bib-links">
    <a href="https://manual.seafile.com/latest/develop/data_model/" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Docs</a>
    <a href="https://github.com/haiwen/seafile-server" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
  </div>
</div>

<div class="bib-entry" id="ref-27">
  <div class="bib-number">[27]</div>
  <div class="bib-citation">Riot Games Technology, "Supercharging Data Delivery: The New League Patcher," 2019. Describes the move from binary deltas to FastCDC-based content-defined chunking for game updates.</div>
  <div class="bib-links">
    <a href="https://technology.riotgames.com/news/supercharging-data-delivery-new-league-patcher" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Article</a>
  </div>
</div>

<div class="bib-entry" id="ref-28">
  <div class="bib-number">[28]</div>
  <div class="bib-citation">IPFS Documentation, "File Systems: Chunking." Describes IPFS's use of Rabin fingerprinting for content-defined chunking alongside fixed-size splitting.</div>
  <div class="bib-links">
    <a href="https://docs.ipfs.tech/concepts/file-systems/" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Docs</a>
    <a href="https://github.com/ipfs/boxo/tree/main/chunker" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
  </div>
</div>

<div class="bib-entry" id="ref-29">
  <div class="bib-number">[29]</div>
  <div class="bib-citation">A. Neumann, "Restic Foundation - Content Defined Chunking," <em>Restic Blog</em>, September 2015. Describes Restic's use of Rabin fingerprint-based CDC for deduplication.</div>
  <div class="bib-links">
    <a href="https://restic.net/blog/2015-09-12/restic-foundation1-cdc/" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Blog</a>
    <a href="https://github.com/restic/chunker" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
  </div>
</div>

<div class="bib-entry" id="ref-30">
  <div class="bib-number">[30]</div>
  <div class="bib-citation">BorgBackup Contributors, "Internals: Data Structures," <em>BorgBackup Documentation</em>. Describes Borg's Buzhash-based chunker with a keyed seed for chunk boundary detection.</div>
  <div class="bib-links">
    <a href="https://borgbackup.readthedocs.io/en/stable/internals/data-structures.html" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Docs</a>
    <a href="https://github.com/borgbackup/borg" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
  </div>
</div>

<div class="bib-entry" id="ref-31">
  <div class="bib-number">[31]</div>
  <div class="bib-citation">Kopia Contributors, "Architecture," <em>Kopia Documentation</em>. Describes Kopia's rolling-hash file splitting for content-addressable deduplication.</div>
  <div class="bib-links">
    <a href="https://kopia.io/docs/advanced/architecture/" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Docs</a>
    <a href="https://github.com/kopia/kopia" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
  </div>
</div>

<div class="bib-entry" id="ref-32">
  <div class="bib-number">[32]</div>
  <div class="bib-citation">G. Chen, "Variable-size Chunking," <em>Duplicacy Design Document</em>. Describes Duplicacy's variable-size CDC with configurable minimum, average, and maximum chunk sizes.</div>
  <div class="bib-links">
    <a href="https://github.com/gilbertchen/duplicacy/wiki/Chunk-Size" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Wiki</a>
    <a href="https://github.com/gilbertchen/duplicacy" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
  </div>
</div>

<div class="bib-entry" id="ref-33">
  <div class="bib-number">[33]</div>
  <div class="bib-citation">A. Chambers, "Technical Overview," <em>Bupstash Documentation</em>. Describes Bupstash's GearHash-based content-defined chunking for encrypted, deduplicated backups.</div>
  <div class="bib-links">
    <a href="https://github.com/andrewchambers/bupstash/blob/master/doc/technical_overview.md" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Docs</a>
    <a href="https://github.com/andrewchambers/bupstash" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
  </div>
</div>

<div class="bib-entry" id="ref-34">
  <div class="bib-number">[34]</div>
  <div class="bib-citation">C. Percival, "How Tarsnap Deduplication Works," <em>Tarsnap Documentation</em>. Describes Tarsnap's context-dependent variable-size chunking for client-side deduplication.</div>
  <div class="bib-links">
    <a href="https://www.tarsnap.com/deduplication-explanation" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Docs</a>
    <a href="https://www.tarsnap.com/" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> Site</a>
  </div>
</div>

<div class="bib-entry" id="ref-35">
  <div class="bib-number">[35]</div>
  <div class="bib-citation">L. Poettering, "casync: Content-Addressable Data Synchronization Tool," 2017. Uses CDC to split filesystem images into variable-size chunks for efficient OS and container image distribution.</div>
  <div class="bib-links">
    <a href="https://github.com/systemd/casync" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> GitHub</a>
    <a href="https://lwn.net/Articles/726625/" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> LWN</a>
  </div>
</div>

</div>

<p><strong>Tools &amp; Implementations</strong></p>
<ul>
  <li><a href="https://github.com/nlfiedler/fastcdc-rs">fastcdc-rs on GitHub</a></li>
</ul>

<hr />

<p><em>The interactive animations in this post are available for experimentation. Try modifying the input text, adjusting chunk size parameters, and watching how CDC adapts to your changes.</em></p>

<div class="cdc-series-nav">
&larr; <a href="/writings/content-defined-chunking-part-2">Part 2: A Deep Dive into FastCDC</a> · Continue reading &rarr; <a href="/writings/content-defined-chunking-part-4">Part 4: CDC in the Cloud</a>
</div>

<script type="module" src="/assets/js/cdc-animations.js"></script>

]]>
      </content:encoded>
    </item>
    
    <item>
      <title>A Deep Dive into FastCDC</title>
      <link>https://rickwinfrey.com/writings/content-defined-chunking-part-2</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/content-defined-chunking-part-2</guid>
      <pubDate>Mon, 09 Feb 2026 12:00:00 +0000</pubDate>
      
      <description>An exploration of FastCDC&apos;s Gear hash, normalized chunking with dual masks, and the 2020 two-byte-per-iteration optimization, with code in pseudocode, Rust, and TypeScript.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<style>
/* ==========================================================================
   CDC Animation Styles
/* View mode tabs (Text / Blocks / Hex) */
.cdc-view-tabs {
  display: flex;
  gap: 0.25rem;
  background: rgba(61, 58, 54, 0.05);
  padding: 0.25rem;
  border-radius: 6px;
}

.cdc-view-tab {
  padding: 0.4rem 0.75rem;
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.8rem;
  color: #8b7355;
  background: transparent;
  border: none;
  border-radius: 4px;
  cursor: pointer;
  transition: all 0.15s ease;
}

.cdc-view-tab:hover {
  color: #3d3a36;
}

.cdc-view-tab.active {
  background: #fff;
  color: #c45a3b;
  box-shadow: 0 1px 3px rgba(0,0,0,0.1);
}

/* Content display area */
.cdc-content {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 1rem;
  line-height: 1.8;
  color: #3d3a36;
}

/* Text view with chunk highlighting */
.cdc-text-view {
  white-space: pre-wrap;
  word-break: break-word;
}

.cdc-text-view .chunk {
  display: inline;
  padding: 0.1rem 0;
  border-radius: 2px;
  transition: background-color 0.2s ease;
}

/* Chunk colors - warm palette matching site */
.cdc-text-view .chunk-0 { background-color: rgba(196, 90, 59, 0.15); }
.cdc-text-view .chunk-1 { background-color: rgba(212, 165, 116, 0.25); }
.cdc-text-view .chunk-2 { background-color: rgba(139, 115, 85, 0.15); }
.cdc-text-view .chunk-3 { background-color: rgba(196, 90, 59, 0.25); }
.cdc-text-view .chunk-4 { background-color: rgba(212, 165, 116, 0.15); }
.cdc-text-view .chunk-5 { background-color: rgba(139, 115, 85, 0.25); }

/* Block view */
.cdc-blocks-view {
  display: flex;
  align-items: stretch;
  gap: 2px;
  margin-top: 1rem;
  padding: 0.5rem 0;
  width: 100%;
}

.cdc-block {
  height: 24px;
  border-radius: 3px;
  transition: all 0.2s ease;
  position: relative;
}

.cdc-block.chunk-0 { background-color: #c45a3b; }
.cdc-block.chunk-1 { background-color: #d4a574; }
.cdc-block.chunk-2 { background-color: #8b7355; }
.cdc-block.chunk-3 { background-color: #c45a3b; opacity: 0.7; }
.cdc-block.chunk-4 { background-color: #d4a574; opacity: 0.7; }
.cdc-block.chunk-5 { background-color: #8b7355; opacity: 0.7; }

.cdc-block:hover {
  transform: scaleY(1.2);
  z-index: 1;
}

/* Hex view */
.cdc-hex-view {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  line-height: 1.6;
  display: flex;
  flex-wrap: wrap;
  gap: 0.5rem 1rem;
}

.cdc-hex-byte {
  padding: 0.15rem 0.3rem;
  border-radius: 2px;
}

/* File icon visualization for fixed vs CDC comparison */
.cdc-file-icon {
  position: relative;
  background: #fff;
  border: 1px solid #d0d0d0;
  border-radius: 3px;
  padding: 1.5rem;
  padding-top: 2.25rem;
  margin: 0.75rem 0;
  box-shadow: 0 2px 8px rgba(0, 0, 0, 0.08);
}

/* Folded corner effect */
.cdc-file-corner {
  position: absolute;
  top: 0;
  right: 0;
  width: 0;
  height: 0;
  border-style: solid;
  border-width: 0 24px 24px 0;
  border-color: transparent #faf9f7 transparent transparent;
  filter: drop-shadow(-1px 1px 1px rgba(0, 0, 0, 0.1));
}

.cdc-file-corner::before {
  content: '';
  position: absolute;
  top: 0;
  right: -24px;
  width: 0;
  height: 0;
  border-style: solid;
  border-width: 0 0 24px 24px;
  border-color: transparent transparent #e8e8e8 transparent;
}

.cdc-file-label {
  position: absolute;
  top: 0.6rem;
  left: 1rem;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  color: #8b7355;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.cdc-file-content {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  line-height: 2.2;
  color: #3d3a36;
  white-space: pre-wrap;
  word-break: break-word;
}

.cdc-chunk-explanation {
  font-size: 0.8rem;
  color: #8b7355;
  margin: 0.25rem 0 0.5rem 0;
  font-style: italic;
}

/* Chunk spans with box styling - matches CHUNK_SOLID_COLORS from cdc-animations.js */
.cdc-chunk {
  padding: 0.2rem 0.35rem;
  border-radius: 3px;
  border: 2px solid;
  display: inline;
  box-decoration-break: clone;
  -webkit-box-decoration-break: clone;
}

.cdc-chunk.chunk-a {
  background: rgba(196, 90, 59, 0.15);
  border-color: #c45a3b;
}

.cdc-chunk.chunk-b {
  background: rgba(90, 138, 90, 0.15);
  border-color: #5a8a5a;
}

.cdc-chunk.chunk-c {
  background: rgba(70, 110, 160, 0.15);
  border-color: #466ea0;
}

.cdc-chunk.chunk-d {
  background: rgba(160, 100, 50, 0.15);
  border-color: #a06432;
}

.cdc-chunk.chunk-e {
  background: rgba(130, 80, 150, 0.15);
  border-color: #825096;
}

/* New chunk - terracotta accent to match interactive demos */
.cdc-chunk.chunk-new {
  background: rgba(196, 90, 59, 0.2);
  border-color: #c45a3b;
  border-style: solid;
}

/* Unchanged chunk - muted gray, matches shared/dedup style in animations */
.cdc-chunk.unchanged {
  background: rgba(61, 58, 54, 0.06);
  border-color: rgba(61, 58, 54, 0.2);
  color: #8b8178;
}

/* Changed chunk - dashed border to signal the chunk content shifted */
.cdc-chunk.changed {
  border-style: dashed;
}

/* Chunk Comparison Demo (JS-powered before/after) */
.cdc-chunk-comparison-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  color: #8b7355;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  margin-bottom: 0.4rem;
}

.cdc-chunk-comparison-file {
  margin-bottom: 0.75rem;
}

.cdc-chunk-comparison-text {
  white-space: pre-wrap;
  word-break: break-word;
  padding: 0.75rem;
  background: rgba(61, 58, 54, 0.02);
  border-radius: 6px;
  border: 1px solid rgba(61, 58, 54, 0.06);
  margin-bottom: 0.5rem;
  font-size: 0.85rem;
  line-height: 1.6;
}

.cdc-cmp-chunk {
  padding: 0.15rem 0.25rem;
  border-radius: 3px;
  border: 2px solid;
  display: inline-block;
  cursor: default;
  transition: filter 0.1s ease;
}

.cdc-cmp-chunk.unchanged {
  background: rgba(61, 58, 54, 0.06);
  border-color: rgba(61, 58, 54, 0.2);
  color: #8b8178;
}

.cdc-cmp-chunk.new {
  border-style: solid;
}

.cdc-cmp-chunk.chunk-hover {
  filter: brightness(0.82);
  outline: 3px solid rgba(61, 58, 54, 0.5);
  outline-offset: 0px;
  box-shadow: 0 0 8px rgba(0, 0, 0, 0.15);
}

.cdc-cmp-chunk.unchanged.chunk-hover {
  filter: brightness(0.85);
  outline: 3px solid rgba(61, 58, 54, 0.4);
  background: rgba(61, 58, 54, 0.15);
}

/* Chunk wrapper: label above, text below */
.cdc-cmp-chunk-wrapper {
  display: inline-flex;
  flex-direction: column;
  align-items: center;
  vertical-align: top;
  margin: 0.15rem 0.2rem;
}

.cdc-chunk-summary {
  display: grid;
  grid-template-columns: repeat(4, 1fr);
  gap: 0.75rem;
  padding: 0.75rem;
  margin: 0.5rem 0;
  background: rgba(61, 58, 54, 0.03);
  border-radius: 6px;
  border: 1px solid rgba(61, 58, 54, 0.06);
}

.cdc-chunk-summary-stat {
  text-align: center;
}

.cdc-chunk-summary-value {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 1.1rem;
  font-weight: 600;
  line-height: 1.2;
}

.cdc-chunk-summary-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.65rem;
  color: #8b7355;
  margin-top: 0.2rem;
  text-transform: uppercase;
  letter-spacing: 0.04em;
}

.cdc-cmp-chunk-label {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  font-weight: 600;
  text-align: center;
  letter-spacing: 0.02em;
  margin-bottom: 0.15rem;
}

/* Edit indicator arrow */
.cdc-edit-indicator {
  text-align: center;
  font-size: 0.8rem;
  color: #8b7355;
  padding: 0.5rem 0;
}

/* Deduplication result */
.cdc-dedup-result {
  text-align: center;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: 600;
  padding: 0.75rem;
  border-radius: 6px;
  margin-top: 0.75rem;
}

.cdc-dedup-result.bad {
  background: rgba(196, 90, 59, 0.1);
  color: #a84832;
}

.cdc-dedup-result.good {
  background: rgba(90, 160, 90, 0.1);
  color: #3d8b3d;
}

/* Rolling window indicator */
.cdc-window {
  position: absolute;
  height: 100%;
  background: rgba(196, 90, 59, 0.3);
  border: 2px solid #c45a3b;
  border-radius: 4px;
  pointer-events: none;
  transition: left 0.1s ease;
}

/* Hash display */
.cdc-hash-display {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.8rem;
  color: #8b7355;
  min-height: 1.4em;
}

.cdc-hash-display strong {
  color: #c45a3b;
}

/* Controls panel */
.cdc-controls {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(140px, 1fr));
  gap: 1.25rem;
  padding: 1.25rem;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-top: none;
  border-radius: 0 0 8px 8px;
}

.cdc-control-group {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.cdc-control-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  color: #3d3a36;
}

.cdc-controls input[type="range"] {
  width: 100%;
  height: 6px;
  -webkit-appearance: none;
  appearance: none;
  background: linear-gradient(to right, #d4a574, #c45a3b);
  border-radius: 3px;
  outline: none;
}

.cdc-controls input[type="range"]::-webkit-slider-thumb {
  -webkit-appearance: none;
  appearance: none;
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
  transition: transform 0.15s ease;
}

.cdc-controls input[type="range"]::-webkit-slider-thumb:hover {
  transform: scale(1.1);
}

.cdc-controls input[type="range"]::-moz-range-thumb {
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  border: none;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
}

/* Playback controls */
.cdc-playback {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  margin-top: 1rem;
  padding: 1rem 1.25rem;
  background: rgba(61, 58, 54, 0.02);
  border-top: 1px solid rgba(61, 58, 54, 0.08);
}

.cdc-playback-btn {
  width: 36px;
  height: 36px;
  border-radius: 50%;
  border: none;
  background: #c45a3b;
  color: #fff;
  cursor: pointer;
  display: flex;
  align-items: center;
  justify-content: center;
  transition: all 0.15s ease;
}

.cdc-playback-btn:hover {
  background: #a84832;
  transform: scale(1.05);
}

.cdc-playback-btn.secondary {
  background: rgba(61, 58, 54, 0.1);
  color: #3d3a36;
}

.cdc-playback-btn.secondary:hover {
  background: rgba(61, 58, 54, 0.2);
}

.cdc-speed-control {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-left: auto;
}

.cdc-speed-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  color: #8b7355;
}

/* Progress indicator */
.cdc-progress {
  flex: 1;
  height: 4px;
  background: rgba(61, 58, 54, 0.1);
  border-radius: 2px;
  overflow: hidden;
  margin: 0 0.5rem;
}

.cdc-progress-bar {
  height: 100%;
  background: linear-gradient(to right, #d4a574, #c45a3b);
  border-radius: 2px;
  transition: width 0.1s ease;
}

/* Side-by-side comparison */
.cdc-comparison {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
  margin: 2rem 0;
}

@media (max-width: 50em) {
  .cdc-comparison {
    grid-template-columns: 1fr;
  }
}

.cdc-comparison-panel {
  padding: 1.25rem;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-comparison-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 1rem;
  font-weight: 600;
  color: #3d3a36;
  margin-bottom: 1rem;
  padding-bottom: 0.75rem;
  border-bottom: 1px solid rgba(61, 58, 54, 0.1);
}

/* Chunk boundary marker */
.cdc-boundary-marker {
  display: inline-block;
  width: 2px;
  height: 1.2em;
  background: #c45a3b;
  margin: 0 1px;
  vertical-align: middle;
  border-radius: 1px;
}

/* Stats display */
.cdc-stats {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(120px, 1fr));
  gap: 1rem;
  padding: 1rem;
  background: rgba(61, 58, 54, 0.02);
  border-radius: 6px;
  margin-top: 1rem;
}

.cdc-stat {
  text-align: center;
}

.cdc-stat-value {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 1.5rem;
  font-weight: 600;
  color: #c45a3b;
}

.cdc-stat-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  color: #8b7355;
  margin-top: 0.25rem;
}

/* Deduplication visualization */
.cdc-dedup-viz {
  margin: 2rem 0;
}

.cdc-dedup-files {
  display: grid;
  grid-template-columns: 1fr auto 1fr;
  gap: 1rem;
  align-items: start;
}

.cdc-dedup-arrow {
  display: flex;
  align-items: center;
  justify-content: center;
  padding: 2rem 0;
  color: #8b7355;
  font-size: 1.5rem;
}

.cdc-dedup-storage {
  margin-top: 1.5rem;
  padding: 1.25rem;
  background: linear-gradient(135deg, rgba(196, 90, 59, 0.05) 0%, rgba(212, 165, 116, 0.08) 100%);
  border-radius: 8px;
}

.cdc-dedup-storage-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem;
  font-weight: 600;
  color: #3d3a36;
  margin-bottom: 0.75rem;
}

.cdc-dedup-chunks {
  display: flex;
  flex-wrap: wrap;
  gap: 0.5rem;
}

.cdc-dedup-chunk {
  padding: 0.4rem 0.75rem;
  border-radius: 4px;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  color: #fff;
}

.cdc-dedup-chunk.shared {
  box-shadow: 0 0 0 2px #fff, 0 0 0 4px currentColor;
}

/* Versioned Dedup - Editor */
.cdc-dedup-editor { display: flex; flex-direction: column; gap: 0.75rem; margin-bottom: 1.5rem; }

.cdc-dedup-textarea {
  width: 100%; min-height: 80px; padding: 0.75rem;
  font-family: 'Source Serif 4', Georgia, serif; font-size: 0.9rem; line-height: 1.6;
  color: #3d3a36; background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.2); border-radius: 6px;
  resize: vertical; box-sizing: border-box;
}
.cdc-dedup-textarea:focus { outline: none; border-color: #c45a3b; box-shadow: 0 0 0 2px rgba(196, 90, 59, 0.15); }

.cdc-dedup-save-btn {
  align-self: flex-start; padding: 0.5rem 1.25rem;
  font-family: 'Libre Baskerville', Georgia, serif; font-size: 0.85rem;
  color: #fff; background: #c45a3b; border: none; border-radius: 6px;
  cursor: pointer; transition: background 0.15s ease, transform 0.1s ease;
}
.cdc-dedup-save-btn:hover { background: #a84832; transform: translateY(-1px); }
.cdc-dedup-save-btn:active { transform: translateY(0); }

/* Versioned Dedup - Timeline */
.cdc-dedup-timeline { position: relative; margin-bottom: 1.5rem; }

.cdc-version-entry { display: flex; gap: 1rem; padding-bottom: 1.5rem; position: relative; }

.cdc-version-entry:not(:last-child)::before {
  content: ''; position: absolute; top: 15px; left: 5px;
  width: 2px; bottom: 0; background: rgba(61, 58, 54, 0.15);
}

.cdc-version-dot {
  position: relative; flex-shrink: 0;
  width: 12px; height: 12px; margin-top: 3px;
  background: #c45a3b; border-radius: 50%;
  border: 2px solid #fff; box-shadow: 0 0 0 1px rgba(61, 58, 54, 0.2);
}

.cdc-version-content { flex: 1; min-width: 0; }

.cdc-version-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem; font-weight: 600; color: #3d3a36; margin-bottom: 0.5rem;
}

.cdc-version-cols { display: grid; grid-template-columns: 1fr 180px; gap: 1rem; align-items: start; }

.cdc-version-text {
  white-space: pre-wrap; word-break: break-word;
  padding: 0.5rem; background: rgba(61, 58, 54, 0.02);
  border-radius: 6px; border: 1px solid rgba(61, 58, 54, 0.06);
}

.cdc-version-blocks { display: flex; flex-direction: column; gap: 0.5rem; }
.cdc-version-blocks .cdc-blocks-view { margin-top: 0; min-height: 24px; }

.cdc-version-stats {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem; color: #8b7355; line-height: 1.4;
}

.cdc-dedup-timeline-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem; font-weight: 600; color: #8b7355;
  text-transform: uppercase; letter-spacing: 0.06em;
  margin-bottom: 0.75rem;
}

[data-chunk-hash].hash-hover {
  filter: brightness(0.85);
  outline: 2px solid rgba(61, 58, 54, 0.4);
  outline-offset: -1px;
}

@media (max-width: 42em) {
  .cdc-version-cols { grid-template-columns: 1fr; }
}

/* Beginner breadcrumb */
/* Table of Contents */
.cdc-toc {
  margin: 2rem 0;
  padding: 1.25rem 1.5rem;
  background: rgba(61, 58, 54, 0.03);
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-toc strong {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.95rem;
  color: #3d3a36;
}

.cdc-toc ol {
  margin: 0.75rem 0 0 0;
  padding-left: 1.25rem;
}

.cdc-toc li {
  margin-bottom: 0.4rem;
  font-size: 0.9rem;
  line-height: 1.5;
  color: #5a564f;
}

.cdc-toc a {
  color: #c45a3b;
  text-decoration: none;
  font-weight: 600;
}

.cdc-toc a:hover {
  text-decoration: underline;
}

.cdc-toc ul {
  margin: 0.25rem 0 0.25rem 0;
  padding-left: 1.25rem;
  list-style: none;
}

.cdc-toc ul li {
  margin-bottom: 0.15rem;
  font-size: 0.82rem;
  color: #8b7355;
}

.cdc-toc ul li a {
  font-weight: 400;
  color: #8b7355;
}

.cdc-toc ul li a:hover {
  color: #c45a3b;
}

/* Taxonomy tree diagram */
.cdc-taxonomy {
  padding: 1.25rem 1.5rem;
  background: rgba(61, 58, 54, 0.03);
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-taxonomy strong {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.95rem;
  color: #3d3a36;
}

.cdc-taxonomy-tree {
  margin-top: 0.75rem;
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0;
}

/* Root node */
.cdc-tax-root {
  padding: 0.4rem 1rem;
  background: #3d3a36;
  color: #fff;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  font-weight: 600;
  border-radius: 6px;
  text-align: center;
}

/* Vertical connector from root */
.cdc-tax-vline {
  width: 2px;
  height: 16px;
  background: rgba(61, 58, 54, 0.25);
}

/* Horizontal bar connecting the three families */
.cdc-tax-hbar {
  width: 80%;
  height: 2px;
  background: rgba(61, 58, 54, 0.25);
  position: relative;
}

/* Three-column family layout */
.cdc-tax-families {
  display: grid;
  grid-template-columns: 1fr 1fr 1fr;
  gap: 0.5rem;
  width: 100%;
  margin-top: 0;
}

.cdc-tax-family {
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0;
}

/* Vertical connector from hbar to family label */
.cdc-tax-family .cdc-tax-vline {
  height: 12px;
}

.cdc-tax-family-label {
  padding: 0.3rem 0.5rem;
  border-radius: 5px;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  text-align: center;
  white-space: nowrap;
}

.cdc-tax-family-label.bsw {
  background: rgba(196, 90, 59, 0.12);
  color: #c45a3b;
  border: 1px solid rgba(196, 90, 59, 0.25);
}

.cdc-tax-family-label.extrema {
  background: rgba(42, 125, 79, 0.1);
  color: #2a7d4f;
  border: 1px solid rgba(42, 125, 79, 0.2);
}

.cdc-tax-family-label.statistical {
  background: rgba(139, 115, 85, 0.12);
  color: #8b7355;
  border: 1px solid rgba(139, 115, 85, 0.25);
}

.cdc-tax-algorithms {
  margin-top: 0.35rem;
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0.2rem;
}

.cdc-tax-algo {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.65rem;
  color: #5a564f;
  line-height: 1.3;
}

.cdc-tax-algo .cdc-tax-year {
  color: #a89b8c;
}

.cdc-learn-more {
  display: inline-flex;
  align-items: center;
  gap: 0.4rem;
  margin-top: 0.75rem;
  padding: 0.4rem 0.75rem;
  background: rgba(212, 165, 116, 0.15);
  border-radius: 4px;
  font-size: 0.8rem;
  font-style: normal;
  color: #8b7355;
}

.cdc-learn-more::before {
  content: "💡";
}

/* Combined text + hex view */
.cdc-combined-view {
  display: flex;
  flex-wrap: wrap;
  gap: 1px;
}

.cdc-byte-col {
  display: flex;
  flex-direction: column;
  align-items: center;
  border-radius: 2px;
  padding: 0.15rem 0.1rem;
}

.cdc-byte-char {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.95rem;
  line-height: 1.4;
  color: #3d3a36;
}

.cdc-byte-hex {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  line-height: 1;
  color: #8b7355;
  margin-top: 1px;
}

/* Block annotation bar (below text/hex views) */
.cdc-block-wrapper {
  display: flex;
  flex-direction: column;
  align-items: center;
}

.cdc-block {
  width: 100%;
}

.cdc-block-annotation {
  width: 100%;
  position: relative;
  margin-top: 0.3rem;
}

.cdc-block-line {
  width: 100%;
  height: 0;
  border-top: 1.5px solid #8b7355;
  opacity: 0.5;
}

.cdc-block-tick {
  position: absolute;
  left: 50%;
  top: 0;
  transform: translateX(-50%);
  width: 1.5px;
  height: 8px;
  background: #8b7355;
  opacity: 0.5;
}

.cdc-block-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #8b7355;
  white-space: nowrap;
  user-select: none;
  line-height: 1;
  text-align: center;
  margin-top: 10px;
  overflow: hidden;
  text-overflow: ellipsis;
}

/* Fix chunk bar height in Gear Hash demo to prevent layout shift */
#gear-hash-demo .cdc-blocks-view {
  min-height: 60px;
}

/* Default text for Gear Hash demo */
#gear-hash-demo .cdc-byte-char {
  color: #000;
  font-weight: 600;
}
#gear-hash-demo .cdc-byte-hex {
  color: #6b5a42;
}

/* Chunk hover highlights */
.cdc-combined-view .cdc-byte-col.chunk-hover {
  filter: brightness(0.85);
  outline: 1px solid rgba(61, 58, 54, 0.25);
  outline-offset: -1px;
}

.cdc-block-wrapper.chunk-hover .cdc-block {
  filter: brightness(1.15);
  box-shadow: 0 0 6px rgba(0, 0, 0, 0.2);
}

.cdc-block-wrapper.chunk-hover .cdc-block-label {
  color: #3d3a36;
  font-weight: 600;
}

.cdc-text-view .chunk.chunk-hover {
  filter: brightness(0.9);
  outline: 1px solid rgba(61, 58, 54, 0.3);
  outline-offset: -1px;
}

/* Gear Lookup Table grid */
.gear-table-grid {
  display: grid;
  grid-template-columns: 1.2rem repeat(16, 1fr);
  grid-template-rows: 0.7rem repeat(16, 13px);
  gap: 1px;
  margin-top: 0.5rem;
  overflow: hidden;
}

.gear-table-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.5rem;
  color: #8b7355;
  display: flex;
  align-items: center;
  justify-content: center;
  user-select: none;
  padding: 0 2px;
}

.gear-table-label.col-header {
  padding-bottom: 2px;
}

.gear-table-label.row-header {
  padding-right: 3px;
}

.gear-table-legend {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.55rem;
  color: #8b7355;
  margin-top: 0.4rem;
  text-align: center;
}

.gear-table-cell {
  min-height: 10px;
  border-radius: 1px;
  cursor: pointer;
  transition: transform 0.1s ease, box-shadow 0.15s ease;
  position: relative;
}

.gear-table-cell:hover {
  transform: scale(1.4);
  z-index: 2;
  box-shadow: 0 0 4px rgba(0,0,0,0.2);
}

.gear-table-cell.active {
  outline: 2px solid #c45a3b;
  outline-offset: 0px;
  box-shadow: 0 0 6px rgba(196, 90, 59, 0.5);
  z-index: 3;
  transform: scale(1.4);
}

.gear-table-readout {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.8rem;
  color: #8b7355;
  min-height: 2.8em;
}

.gear-table-readout strong {
  color: #c45a3b;
}

/* Rolling hash window strip */
.gear-hash-window {
  display: flex;
  gap: 1px;
  overflow-x: auto;
  padding: 0.25rem 0;
  margin-bottom: 0.5rem;
  min-height: 3.2rem;
}

.gear-hw-cell {
  display: flex;
  flex-direction: column;
  align-items: center;
  min-width: 2rem;
  padding: 0.2rem 0.15rem;
  border-radius: 3px;
  background: rgba(61, 58, 54, 0.04);
  transition: background-color 0.15s ease;
}

.gear-hw-cell.current {
  outline: 2px solid #c45a3b;
  outline-offset: -1px;
}

.gear-hw-cell.boundary {
  border-right: 2px solid #2a7d4f;
}

.gear-hw-char {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.8rem;
  color: #3d3a36;
  line-height: 1.2;
}

.gear-hw-hash {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.5rem;
  color: #8b7355;
  line-height: 1;
  margin-top: 2px;
}

.gear-hw-hash.boundary {
  color: #2a7d4f;
  font-weight: 700;
}

/* Bit-shift visualization */
.gear-shift-viz {
  margin-bottom: 0;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
}

.gear-shift-row {
  display: flex;
  align-items: center;
  gap: 0.3rem;
  margin-bottom: 3px;
}

.gear-shift-label {
  width: 4rem;
  text-align: right;
  color: #8b7355;
  font-size: 0.6rem;
  flex-shrink: 0;
}

.gear-shift-hex {
  width: 5.5rem;
  text-align: right;
  color: #3d3a36;
  font-size: 0.6rem;
  flex-shrink: 0;
  padding-right: 0.3rem;
}

.gear-shift-bits {
  display: flex;
  gap: 0;
  position: relative;
}

.gear-bit {
  width: 7px;
  height: 14px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.45rem;
  line-height: 1;
  border-radius: 1px;
}

.gear-bit.b0,
.gear-bit.b1 {
  background: rgba(61, 58, 54, 0.06);
  color: #3d3a36;
}

.gear-bit.dropped {
  background: rgba(196, 90, 59, 0.25);
  color: #c45a3b;
  text-decoration: line-through;
}

.gear-bit.entering {
  background: rgba(90, 138, 90, 0.25);
  color: #5a8a5a;
  font-weight: 700;
}

@keyframes gear-slide-left {
  0% { transform: translateX(7px); opacity: 0.5; }
  100% { transform: translateX(0); opacity: 1; }
}

.gear-shift-bits.animated .gear-bit {
  animation: gear-slide-left 0.25s ease-out;
}

.gear-shift-box {
  border: 1.5px solid rgba(196, 90, 59, 0.3);
  border-radius: 6px;
  padding: 0.4rem 0.5rem;
  background: rgba(196, 90, 59, 0.02);
}

.gear-shift-connector {
  text-align: center;
  color: #8b7355;
  font-size: 0.7rem;
  line-height: 1;
  padding: 0.15rem 0;
}

.gear-shift-add {
  border: 1.5px solid rgba(61, 58, 54, 0.12);
  border-radius: 6px;
  padding: 0.4rem 0.5rem;
  background: rgba(61, 58, 54, 0.02);
}

.gear-shift-separator {
  width: calc(32 * 7px);
  border-top: 1px solid rgba(61, 58, 54, 0.15);
  margin: 2px 0;
}

/* Two-column layout: Operation panel + Gear table */
.gear-two-col {
  display: grid;
  grid-template-columns: 1fr 1fr;
  grid-template-rows: auto auto auto auto;
  gap: 0 1.5rem;
  margin-top: 1rem;
}

.gear-col-left {
  display: contents;
}

.gear-col-right {
  display: contents;
}

.gear-col-left > .cdc-viz-header {
  grid-column: 1;
  grid-row: 1;
}

.gear-col-right > .cdc-viz-header {
  grid-column: 2;
  grid-row: 1;
}

.gear-col-left > .gear-table-readout {
  grid-column: 1;
  grid-row: 2;
}

.gear-col-right > .cdc-hash-display {
  grid-column: 2;
  grid-row: 2;
}

.gear-col-left > .gear-table-grid {
  grid-column: 1;
  grid-row: 3 / 5;
}

.gear-col-left > .gear-table-legend {
  grid-column: 1;
  grid-row: 5;
}

.gear-col-right > .gear-hash-window {
  grid-column: 2;
  grid-row: 3;
}

.gear-col-right > .gear-shift-viz {
  grid-column: 2;
  grid-row: 4;
  align-self: start;
}

/* Chunk boundary marker (vertical separator) */
.chunk-boundary-marker {
  display: inline-block;
  width: 2px;
  height: 1.2em;
  background: #c45a3b;
  margin: 0 2px;
  vertical-align: middle;
  border-radius: 1px;
  opacity: 0.6;
}

.chunk-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #8b7355;
  background: rgba(61, 58, 54, 0.06);
  padding: 0.1rem 0.3rem;
  border-radius: 2px;
  margin-right: 2px;
  vertical-align: top;
  line-height: 1;
  user-select: none;
}

/* KDE density curve (ComparisonDemo) */

/* Override container when used for KDE */
.kde-distribution-chart {
  display: block;
  height: 170px;
  padding: 0;
  margin-top: 0.75rem;
}

/* Plot area wrapper */
.kde-chart-wrapper {
  position: absolute;
  top: 0;
  left: 52px;
  right: 5px;
  bottom: 30px;
  overflow: visible;
}

/* SVG fills the wrapper */
.kde-chart-svg {
  display: block;
  width: 100%;
  height: 100%;
}

/* Tick labels (shared) */
.kde-tick {
  position: absolute;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.55rem;
  color: #8b7355;
  pointer-events: none;
  white-space: nowrap;
}

/* X-axis ticks: below the chart */
.kde-tick-x {
  bottom: 0;
  transform: translate(-50%, calc(100% + 2px));
}

/* Y-axis ticks: left of the chart */
.kde-tick-y {
  left: 0;
  transform: translate(calc(-100% - 4px), -50%);
  text-align: right;
}

/* Axis title labels */
.kde-axis-title {
  position: absolute;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.55rem;
  color: #8b7355;
  pointer-events: none;
  white-space: nowrap;
}

/* X-axis title: centered below ticks */
.kde-axis-title-x {
  bottom: 0;
  left: 50%;
  transform: translate(-50%, calc(100% + 16px));
}

/* Y-axis title: rotated, centered left of ticks */
.kde-axis-title-y {
  top: 50%;
  left: 0;
  transform: translate(calc(-100% - 30px), -50%) rotate(-90deg);
  transform-origin: center center;
}

/* Caption below charts explaining density */
.kde-caption {
  font-size: 0.78rem;
  color: #8b7355;
  text-align: center;
  margin-top: 0.25rem;
  margin-bottom: 0.5rem;
  line-height: 1.45;
}

/* Reference label inside wrapper */
.kde-ref-label {
  position: absolute;
  top: 0;
  margin-left: 5px;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #c45a3b;
  white-space: nowrap;
  pointer-events: none;
}

/* Parametric Chunking Explorer - distribution chart */
.parametric-distribution-chart {
  position: relative;
  display: flex;
  align-items: flex-end;
  gap: 2px;
  height: 120px;
  padding: 0.5rem 0;
  margin-bottom: 1rem;
}

.parametric-dist-bar {
  flex: 1;
  min-width: 3px;
  border-radius: 2px 2px 0 0;
  transition: height 0.2s ease;
  cursor: pointer;
  position: relative;
}

.parametric-dist-bar:hover { opacity: 0.8; }

.parametric-dist-tooltip {
  display: none;
  position: absolute;
  bottom: 100%;
  left: 50%;
  transform: translateX(-50%);
  background: #3d3a36;
  color: #fff;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem;
  padding: 0.25rem 0.5rem;
  border-radius: 4px;
  white-space: nowrap;
  pointer-events: none;
  margin-bottom: 4px;
  z-index: 10;
}

.parametric-dist-bar:hover .parametric-dist-tooltip { display: block; }

.parametric-dist-reference {
  position: absolute;
  left: 0;
  right: 0;
  border-top: 2px dashed rgba(196, 90, 59, 0.5);
  pointer-events: none;
  z-index: 1;
}

.parametric-dist-reference-label {
  position: absolute;
  right: 0;
  top: -1.1rem;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #c45a3b;
  white-space: nowrap;
}

.parametric-dist-bar.chunk-hover {
  filter: brightness(1.15);
  box-shadow: 0 0 6px rgba(0, 0, 0, 0.2);
}

.parametric-derived-params {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  color: #8b7355;
  white-space: nowrap;
}

/* Basic vs Normalized Comparison */
.comparison-columns {
  display: grid;
  grid-template-columns: 1fr;
  gap: 1.5rem;
}

.comparison-label {
  display: flex;
  align-items: baseline;
  justify-content: space-between;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.95rem;
  font-weight: 600;
  color: #3d3a36;
  padding-bottom: 0.5rem;
  border-bottom: 1px solid rgba(61, 58, 54, 0.1);
  margin-bottom: 0.75rem;
}

.comparison-label-text {
  white-space: nowrap;
}

.comparison-sublabel {
  font-weight: 400;
  color: #8b7355;
  font-size: 0.85rem;
}

.comparison-col {
  min-width: 0;
  overflow: visible;
}

.comparison-col .cdc-blocks-view {
  overflow: hidden;
  height: 76px;
  position: relative;
}

.comparison-col .cdc-block-wrapper {
  min-width: 0;
  overflow: hidden;
}

.comparison-col .cdc-block {
  margin-top: auto;
}

/* Distribution illustration (static SVG curves) */
.dist-illustration {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
}

@media (max-width: 42em) {
  .dist-illustration {
    grid-template-columns: 1fr;
  }
}

.dist-illustration-svg {
  display: block;
  width: 100%;
  height: auto;
  overflow: visible;
}

.dist-illustration-note {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  color: #8b7355;
  margin-top: 0.35rem;
  line-height: 1.4;
}

.cdc-blocks-target-line {
  position: absolute;
  left: 0;
  right: 0;
  height: 0;
  border-top: 1.5px dashed rgba(61, 58, 54, 0.7);
  pointer-events: none;
  z-index: 1;
}

.comparison-summary {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  font-weight: 400;
  color: #8b7355;
}

.comparison-summary span {
  color: #c45a3b;
  font-weight: 600;
}

.parametric-section-header {
  display: flex;
  align-items: baseline;
  justify-content: space-between;
  margin-top: 1rem;
  padding-top: 0.75rem;
  border-top: 1px solid rgba(61, 58, 54, 0.1);
}

.parametric-section-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem;
  font-weight: 600;
  color: #3d3a36;
  white-space: nowrap;
}

#parametric-blocks-bar {
  height: 70px;
  position: relative;
  overflow: hidden;
}

#parametric-blocks-bar .cdc-block-wrapper {
  min-width: 0;
  overflow: hidden;
}

#parametric-blocks-bar .cdc-block {
  margin-top: auto;
}

.parametric-summary {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  color: #8b7355;
  margin-bottom: 0.5rem;
}

.parametric-summary span {
  color: #c45a3b;
  font-weight: 600;
}

.comparison-blocks-hint {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.65rem;
  color: #a09080;
  margin-top: 0.35rem;
  line-height: 1.35;
}

/* Mobile responsive */
@media (max-width: 42em) {
  .cdc-controls {
    grid-template-columns: 1fr;
  }

  .cdc-dedup-files {
    grid-template-columns: 1fr;
  }

  .cdc-dedup-arrow {
    transform: rotate(90deg);
    padding: 1rem 0;
  }

  .cdc-chunk-summary {
    grid-template-columns: repeat(2, 1fr);
  }

  .cdc-hex-view {
    font-size: 0.65rem;
  }

  .gear-two-col {
    grid-template-columns: 1fr;
  }

  .gear-col-left > .cdc-viz-header { grid-row: auto; }
  .gear-col-right > .cdc-viz-header { grid-row: auto; }
  .gear-col-left > .gear-table-readout { grid-row: auto; }
  .gear-col-right > .cdc-hash-display { grid-row: auto; }
  .gear-col-left > .gear-table-grid { grid-row: auto; }
  .gear-col-left > .gear-table-legend { grid-row: auto; }
  .gear-col-right > .gear-hash-window { grid-row: auto; }
  .gear-col-right > .gear-shift-viz { grid-row: auto; }

  .gear-col-left > *,
  .gear-col-right > * {
    grid-column: 1;
  }

  .gear-table-grid {
    gap: 0px;
  }

  .gear-table-cell {
    border-radius: 0;
    height: 12px;
  }

  .gear-table-label {
    font-size: 0.4rem;
  }

}

</style>

<!-- MathJax for rendering mathematical notation -->
<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$']],
    displayMath: [['$$', '$$']]
  }
};
</script>

<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" async=""></script>

<div class="cdc-series-nav">
Part 2 of 5 in a series on Content-Defined Chunking. Previous: <a href="/writings/content-defined-chunking-part-1">Part 1: From Problem to Taxonomy</a> · Next: <a href="/writings/content-defined-chunking-part-3">Part 3: Deduplication in Action</a>
</div>

<p><a href="/writings/content-defined-chunking-part-1">Part 1</a> introduced the deduplication problem, showed why fixed-size chunking fails, and surveyed three CDC algorithm families: Basic Sliding Window (BSW), Local Extrema, and Statistical. This post focuses on <strong>FastCDC</strong>, the most widely adopted BSW algorithm, exploring its Gear hash, normalized chunking strategy, and tunable parameters through interactive demos.</p>

<hr />

<h2 id="a-closer-look-at-bsw-via-fastcdc">A Closer Look at BSW via FastCDC</h2>

<p>The <strong>Basic Sliding Window</strong> family includes FastCDC, which is true to its name: it is <em>fast</em>. Where Rabin used polynomial arithmetic and Buzhash used cyclic shifts, FastCDC’s Gear hash strips the rolling hash down to its simplest possible form. That speed, combined with normalized chunking for tighter chunk-size distributions, has made FastCDC one of the most widely implemented CDC algorithms today, with mature libraries in Rust, Go, Python, Java, and C++. We’ll explore both the 2016<span class="cdc-cite"><a href="#ref-5">[5]</a></span> and 2020<span class="cdc-cite"><a href="#ref-6">[6]</a></span> versions in detail, using the <a href="https://github.com/nlfiedler/fastcdc-rs"><code>fastcdc-rs</code></a> Rust crate as our reference implementation.</p>

<h3 id="the-gear-hash">The Gear Hash</h3>

<p>At FastCDC’s core is the <strong>Gear hash</strong>, a rolling hash reduced to two operations. For each byte, you:</p>
<ol>
  <li><strong>Left-shift</strong> the current hash by one bit, dropping the most significant bit</li>
  <li><strong>Add</strong> the value from the Gear table keyed by the current byte, a pre-computed 64-bit random value, to the hash</li>
</ol>

<p>That’s it. No XOR with outgoing bytes, no polynomial division. Just shift and add.</p>

<p>The visualization below uses 32-bit values for compactness. The real FastCDC implementation uses 64-bit Gear table entries and a 64-bit hash accumulator, as shown in the code samples that follow. The algorithm works identically at either width – only the bit count and mask positions change. To internalize the algorithm, try advancing the animation one step at a time and observe the Gear table lookup, the rolling hash manipulation, and the binary addition at each position. Playing the animation at full speed is also useful for seeing the overall flow, but stepping through it frame by frame is the best way to build intuition.</p>

<div class="cdc-viz" id="gear-hash-demo">
  <div class="cdc-viz-header">
    <div class="cdc-viz-title">Gear Hash in Action</div>
  </div>
  <div class="cdc-content">
    <div id="gear-content-display" class="cdc-combined-view">
      The quick brown fox jumps over the lazy dog. She packed her seven boxes and left. A warm breeze drifted through the open window.
    </div>
  </div>

  <!-- Two-column layout: Gear table on left, Operation + Hash on right -->
  <div class="gear-two-col">
    <div class="gear-col-left">
      <div class="cdc-viz-header" style="border-bottom: none; margin-bottom: 0.5rem; padding-bottom: 0;">
        <div class="cdc-viz-title">Gear Lookup Table</div>
        <p class="cdc-viz-hint">Each colored block is one of 256 pre-computed random 32-bit values, keyed by byte. Hover a cell to see its mapping.</p>
      </div>
      <div class="gear-table-readout" id="gear-table-readout">GEAR[--] = --</div>
      <div class="gear-table-grid" id="gear-table-grid">
        <!-- axis labels + 256 cells populated by JS -->
      </div>
      <div class="gear-table-legend">Rows 0-1: control bytes &middot; Rows 2-7: printable ASCII &middot; Row 7F: DEL &middot; Rows 8-F: extended bytes</div>
    </div>

    <div class="gear-col-right">
      <div class="cdc-viz-header" style="border-bottom: none; margin-bottom: 0; padding-bottom: 0;">
        <div class="cdc-viz-title">Rolling Hash Window</div>
        <p class="cdc-viz-hint">The hash rolls forward one byte at a time. When it matches a bit pattern, a chunk boundary is placed. Target chunk size: min 8, avg 16, max 32 bytes.</p>
      </div>
      <div class="cdc-hash-display" id="gear-hash-display">Current Hash: <strong>0x00000000</strong></div>
      <div class="gear-hash-window" id="gear-hash-window"></div>
      <div class="gear-shift-viz" id="gear-shift-viz"></div>
    </div>
  </div>

  <!-- Playback Controls -->
  <div class="cdc-playback">
    <button class="cdc-playback-btn" id="gear-play-btn" title="Play / Pause">
      <span class="fa-solid fa-play"></span>
    </button>
    <button class="cdc-playback-btn secondary" id="gear-step-btn" title="Step forward one byte">
      <span class="fa-solid fa-forward-step"></span>
    </button>
    <button class="cdc-playback-btn secondary" id="gear-reset-btn" title="Reset to beginning">
      <span class="fa-solid fa-rotate-left"></span>
    </button>
    <div class="cdc-progress">
      <div class="cdc-progress-bar" id="gear-progress" style="width: 0%"></div>
    </div>
    <div class="cdc-speed-control">
      <span class="cdc-speed-label">Speed</span>
      <input type="range" id="gear-speed" min="1" max="10" value="7" style="width: 80px;" title="Playback speed" />
    </div>
  </div>
</div>

<p>The Gear table maps each of the 256 possible byte values to a pre-computed 64-bit random number:</p>

<div class="code-tabs" id="gear-table-code">
  <div class="code-tab-buttons">
    <button class="code-tab-btn active" data-lang="pseudocode">Pseudocode</button>
    <button class="code-tab-btn" data-lang="rust">Rust</button>
    <button class="code-tab-btn" data-lang="typescript">TypeScript</button>
  </div>

  <div class="code-tab-content active" data-lang="pseudocode">

<figure class="highlight"><pre><code class="language-text" data-lang="text"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>// GEAR table: 256 random 64-bit values
GEAR[256] = generate_random_table()

function gear_hash(data, start, end):
    hash = 0
    for i from start to end:
        hash = (hash &lt;&lt; 1) + GEAR[data[i]]
    return hash
</pre></td></tr></tbody></table></code></pre></figure>

  </div>

  <div class="code-tab-content" data-lang="rust">

<figure class="highlight"><pre><code class="language-rust" data-lang="rust"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="c1">// From fastcdc-rs v2020</span>
<span class="k">const</span> <span class="n">GEAR</span><span class="p">:</span> <span class="p">[</span><span class="nb">u64</span><span class="p">;</span> <span class="mi">256</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="mi">0x3b5d3c7d207e37dc</span><span class="p">,</span> <span class="mi">0x784d68ba91123086</span><span class="p">,</span>
    <span class="mi">0xcd52880f882e7298</span><span class="p">,</span> <span class="mi">0xecc4917415d5c696</span><span class="p">,</span>
    <span class="c1">// ... 252 more values</span>
<span class="p">];</span>

<span class="k">fn</span> <span class="nf">gear_hash</span><span class="p">(</span><span class="n">data</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">u8</span><span class="p">])</span> <span class="k">-&gt;</span> <span class="nb">u64</span> <span class="p">{</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">hash</span><span class="p">:</span> <span class="nb">u64</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="o">&amp;</span><span class="n">byte</span> <span class="k">in</span> <span class="n">data</span> <span class="p">{</span>
        <span class="n">hash</span> <span class="o">=</span> <span class="n">hash</span><span class="nf">.wrapping_shl</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
                   <span class="nf">.wrapping_add</span><span class="p">(</span><span class="n">GEAR</span><span class="p">[</span><span class="n">byte</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]);</span>
    <span class="p">}</span>
    <span class="n">hash</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

  </div>

  <div class="code-tab-content" data-lang="typescript">

<figure class="highlight"><pre><code class="language-typescript" data-lang="typescript"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre><span class="c1">// GEAR table (first few values shown)</span>
<span class="kd">const</span> <span class="nx">GEAR</span><span class="p">:</span> <span class="nx">bigint</span><span class="p">[]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="mh">0x3b5d3c7d207e37dc</span><span class="nx">n</span><span class="p">,</span> <span class="mh">0x784d68ba91123086</span><span class="nx">n</span><span class="p">,</span>
    <span class="mh">0xcd52880f882e7298</span><span class="nx">n</span><span class="p">,</span> <span class="mh">0xecc4917415d5c696</span><span class="nx">n</span><span class="p">,</span>
    <span class="c1">// ... 252 more values</span>
<span class="p">];</span>

<span class="kd">function</span> <span class="nx">gearHash</span><span class="p">(</span><span class="nx">data</span><span class="p">:</span> <span class="nb">Uint8Array</span><span class="p">):</span> <span class="nx">bigint</span> <span class="p">{</span>
    <span class="kd">let</span> <span class="nx">hash</span> <span class="o">=</span> <span class="mi">0</span><span class="nx">n</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kd">const</span> <span class="nx">byte</span> <span class="k">of</span> <span class="nx">data</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">hash</span> <span class="o">=</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="nx">n</span><span class="p">)</span> <span class="o">+</span> <span class="nx">GEAR</span><span class="p">[</span><span class="nx">byte</span><span class="p">])</span> <span class="o">&amp;</span> <span class="mh">0xFFFFFFFFFFFFFFFF</span><span class="nx">n</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nx">hash</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

  </div>
</div>

<style>
.code-tabs {
  margin: 1.5rem 0;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
  overflow: hidden;
}

.code-tab-buttons {
  display: flex;
  background: rgba(61, 58, 54, 0.03);
  border-bottom: 1px solid rgba(61, 58, 54, 0.1);
}

.code-tab-btn {
  padding: 0.6rem 1.25rem;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  color: #8b7355;
  background: transparent;
  border: none;
  cursor: pointer;
  transition: all 0.15s ease;
}

.code-tab-btn:hover {
  color: #3d3a36;
  background: rgba(61, 58, 54, 0.05);
}

.code-tab-btn.active {
  color: #c45a3b;
  background: #fff;
  border-bottom: 2px solid #c45a3b;
  margin-bottom: -1px;
}

.code-tab-content {
  display: none;
}

.code-tab-content.active {
  display: block;
}

.code-tab-content pre {
  margin: 0;
  border-radius: 0;
}
</style>

<script>
document.addEventListener('DOMContentLoaded', () => {
  document.querySelectorAll('.code-tabs').forEach(container => {
    const buttons = container.querySelectorAll('.code-tab-btn');
    const contents = container.querySelectorAll('.code-tab-content');

    buttons.forEach(btn => {
      btn.addEventListener('click', () => {
        const lang = btn.dataset.lang;

        buttons.forEach(b => b.classList.remove('active'));
        contents.forEach(c => c.classList.remove('active'));

        btn.classList.add('active');
        container.querySelector(`.code-tab-content[data-lang="${lang}"]`).classList.add('active');
      });
    });
  });
});
</script>

<p>The simplicity of this hash is the point. A single left-shift and a single addition per byte gives the Gear hash its speed advantage over Rabin (which requires polynomial division) and Buzhash (which requires XOR with both the incoming and outgoing byte). But a fast hash is only half the story. The other half is deciding <em>when</em> the hash value signals a chunk boundary.</p>

<h3 id="finding-chunk-boundaries">Finding Chunk Boundaries</h3>

<p>With the Gear hash updating for each byte, how do we decide where to cut?</p>

<p>The classic approach: check if the low N bits of the hash are zero. If we want an average chunk size of 8KB, we check if <code class="language-plaintext highlighter-rouge">hash &amp; 0x1FFF == 0</code> (the low 13 bits).</p>

<p>Why does this work? The Gear hash produces pseudo-random values, so the probability that any N bits are all zero is $1/2^N$. For 13 bits, that’s $1/2^{13} = 1/8192$, meaning on average, one in every 8,192 bytes triggers a boundary. The mask <em>is</em> the chunk size control: more bits mean larger average chunks, fewer bits mean smaller ones.</p>

<p>This is the heart of every BSW algorithm. The algorithm doesn’t search for patterns in the content directly. Instead, it feeds each byte through the Gear table, lets the rolling hash mix the values together, and checks whether certain bits of the result happen to be zero. The content determines the hash, the mask selects which bits to check, and a boundary is placed wherever those bits are all zero.</p>

<p>But basic masking has a problem, and FastCDC does something more clever: <strong>normalized chunking</strong> with dual masks.</p>

<p>The problem with basic masking is chunk sizes follow an exponential distribution. You get many small chunks and occasional very large ones. This hurts deduplication because small chunks increase metadata overhead, and large chunks reduce sharing opportunities.</p>

<div class="cdc-viz">
  <div class="cdc-viz-header">
    <span class="cdc-viz-title">Basic vs Normalized Chunk Size Distribution</span>
  </div>
  <div class="dist-illustration">
    <div class="dist-illustration-panel">
      <div class="comparison-label"><span class="comparison-label-text">Basic CDC <span class="comparison-sublabel">(Single Mask)</span></span></div>
      <svg class="dist-illustration-svg" viewBox="0 -8 200 118" preserveAspectRatio="xMidYMid meet">
        <defs>
          <linearGradient id="exp-fill" x1="0" y1="0" x2="0" y2="1">
            <stop offset="0%" stop-color="#c45a3b" stop-opacity="0.25" />
            <stop offset="100%" stop-color="#c45a3b" stop-opacity="0.05" />
          </linearGradient>
        </defs>
        <path d="M 0 100 C 5 100, 10 100, 15 98 C 20 90, 22 30, 25 12 C 28 25, 40 50, 60 68 C 80 78, 110 88, 150 94 C 170 96, 190 98, 200 99 L 200 100 Z" fill="url(#exp-fill)" stroke="#c45a3b" stroke-width="1.5" stroke-linejoin="round" />
        <line x1="100" y1="5" x2="100" y2="100" stroke="#3d3a36" stroke-width="1" stroke-dasharray="4 3" opacity="0.5" />
        <text x="100" y="2" text-anchor="middle" font-family="'Libre Baskerville', Georgia, serif" font-size="6" fill="#8b7355">target avg</text>
        <text x="100" y="109" text-anchor="middle" font-family="'Libre Baskerville', Georgia, serif" font-size="6" fill="#8b7355">Chunk size &#x2192;</text>
      </svg>
      <div class="dist-illustration-note">Many small chunks, long tail of large ones</div>
    </div>
    <div class="dist-illustration-panel">
      <div class="comparison-label"><span class="comparison-label-text">Normalized CDC <span class="comparison-sublabel">(Dual Mask)</span></span></div>
      <svg class="dist-illustration-svg" viewBox="0 -8 200 118" preserveAspectRatio="xMidYMid meet">
        <defs>
          <linearGradient id="gauss-fill" x1="0" y1="0" x2="0" y2="1">
            <stop offset="0%" stop-color="#c45a3b" stop-opacity="0.25" />
            <stop offset="100%" stop-color="#c45a3b" stop-opacity="0.05" />
          </linearGradient>
        </defs>
        <path d="M 0 100 C 10 100, 25 99, 40 97 C 55 93, 65 80, 75 55 C 85 28, 90 14, 100 10 C 110 14, 115 28, 125 55 C 135 80, 145 93, 160 97 C 175 99, 190 100, 200 100 Z" fill="url(#gauss-fill)" stroke="#c45a3b" stroke-width="1.5" stroke-linejoin="round" />
        <line x1="100" y1="5" x2="100" y2="100" stroke="#3d3a36" stroke-width="1" stroke-dasharray="4 3" opacity="0.5" />
        <text x="100" y="2" text-anchor="middle" font-family="'Libre Baskerville', Georgia, serif" font-size="6" fill="#8b7355">target avg</text>
        <text x="100" y="109" text-anchor="middle" font-family="'Libre Baskerville', Georgia, serif" font-size="6" fill="#8b7355">Chunk size &#x2192;</text>
      </svg>
      <div class="dist-illustration-note">Chunks cluster tightly around the target</div>
    </div>
  </div>
</div>

<p>Why exponential? Because the probability of hitting a boundary is the same at every byte position. Each byte has a $1/2^N$ chance of triggering a cut, regardless of how many bytes have already been consumed into the current chunk. Short chunks are always more likely than long ones, for the same reason that flipping heads on the first try is more likely than waiting until the tenth.</p>

<p>The fix is intuitive: vary the probability based on how many bytes have been consumed into the current chunk so far. If only a few bytes have been consumed, make boundaries <em>harder</em> to find so you don’t produce tiny chunks. Once you’ve consumed past the target average, make boundaries <em>easier</em> to find so chunks don’t grow too large.</p>

<p>FastCDC’s solution:</p>
<ol>
  <li><strong>Below the average size</strong>: Use a <strong>stricter mask</strong> (more bits must be zero)</li>
  <li><strong>Above the average size</strong>: Use a <strong>looser mask</strong> (fewer bits must be zero)</li>
</ol>

<p>FastCDC was published in two rounds: a 2016 paper that introduced normalized chunking, and a 2020 revision that added a performance optimization by processing two bytes per iteration. Let’s walk through both versions to see how the dual-mask idea translates into concrete code, and how the algorithm evolved between the two papers.</p>

<h3 id="the-2016-algorithm">The 2016 Algorithm</h3>

<p>This is the version illustrated above: it processes one byte at a time, shifting and adding through the Gear table exactly as we saw in the <em>Gear Hash in Action</em> animation, and switches between the strict and loose mask depending on how far into the current chunk it has consumed. Here’s the complete loop:</p>

<div class="code-tabs" id="fastcdc-2016-code">
  <div class="code-tab-buttons">
    <button class="code-tab-btn active" data-lang="pseudocode">Pseudocode</button>
    <button class="code-tab-btn" data-lang="rust">Rust</button>
    <button class="code-tab-btn" data-lang="typescript">TypeScript</button>
  </div>

  <div class="code-tab-content active" data-lang="pseudocode">

<figure class="highlight"><pre><code class="language-text" data-lang="text"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</pre></td><td class="code"><pre>function find_chunk_boundary(data, min_size, avg_size, max_size):
    // Skip to minimum size (cut-point skipping)
    position = min_size
    hash = 0

    // Calculate masks based on average size
    bits = log2(avg_size)
    mask_s = MASKS[bits + 1]  // Stricter (1 more bit)
    mask_l = MASKS[bits - 1]  // Looser  (1 fewer bit)

    // Phase 1: Strict mask until average size
    while position &lt; avg_size AND position &lt; data.length:
        hash = (hash &lt;&lt; 1) + GEAR[data[position]]
        if (hash &amp; mask_s) == 0:
            return position  // Found boundary!
        position += 1

    // Phase 2: Loose mask until maximum size
    while position &lt; max_size AND position &lt; data.length:
        hash = (hash &lt;&lt; 1) + GEAR[data[position]]
        if (hash &amp; mask_l) == 0:
            return position  // Found boundary!
        position += 1

    // Hit maximum size without finding boundary
    return min(max_size, data.length)
</pre></td></tr></tbody></table></code></pre></figure>

  </div>

  <div class="code-tab-content" data-lang="rust">

<figure class="highlight"><pre><code class="language-rust" data-lang="rust"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
</pre></td><td class="code"><pre><span class="c1">// From fastcdc-rs v2016 module</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">cut</span><span class="p">(</span>
    <span class="n">source</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">u8</span><span class="p">],</span>
    <span class="n">min_size</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span>
    <span class="n">avg_size</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span>
    <span class="n">max_size</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span>
    <span class="n">mask_s</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="n">mask_l</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
<span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">usize</span> <span class="p">{</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">remaining</span> <span class="o">=</span> <span class="n">source</span><span class="nf">.len</span><span class="p">();</span>
    <span class="k">if</span> <span class="n">remaining</span> <span class="o">&lt;=</span> <span class="n">min_size</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">remaining</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">let</span> <span class="k">mut</span> <span class="n">center</span> <span class="o">=</span> <span class="n">avg_size</span><span class="p">;</span>
    <span class="k">if</span> <span class="n">remaining</span> <span class="o">&gt;</span> <span class="n">max_size</span> <span class="p">{</span>
        <span class="n">remaining</span> <span class="o">=</span> <span class="n">max_size</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">remaining</span> <span class="o">&lt;</span> <span class="n">center</span> <span class="p">{</span>
        <span class="n">center</span> <span class="o">=</span> <span class="n">remaining</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">let</span> <span class="k">mut</span> <span class="n">index</span> <span class="o">=</span> <span class="n">min_size</span><span class="p">;</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">hash</span><span class="p">:</span> <span class="nb">u64</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="c1">// Phase 1: strict mask until center (average size)</span>
    <span class="k">while</span> <span class="n">index</span> <span class="o">&lt;</span> <span class="n">center</span> <span class="p">{</span>
        <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="p">)</span><span class="nf">.wrapping_add</span><span class="p">(</span><span class="n">GEAR</span><span class="p">[</span><span class="n">source</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_s</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">index</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Phase 2: loose mask until max_size</span>
    <span class="k">while</span> <span class="n">index</span> <span class="o">&lt;</span> <span class="n">remaining</span> <span class="p">{</span>
        <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="p">)</span><span class="nf">.wrapping_add</span><span class="p">(</span><span class="n">GEAR</span><span class="p">[</span><span class="n">source</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_l</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">index</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">remaining</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

  </div>

  <div class="code-tab-content" data-lang="typescript">

<figure class="highlight"><pre><code class="language-typescript" data-lang="typescript"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
</pre></td><td class="code"><pre><span class="kd">function</span> <span class="nx">findChunkBoundary</span><span class="p">(</span>
    <span class="nx">data</span><span class="p">:</span> <span class="nb">Uint8Array</span><span class="p">,</span>
    <span class="nx">minSize</span><span class="p">:</span> <span class="kr">number</span><span class="p">,</span>
    <span class="nx">avgSize</span><span class="p">:</span> <span class="kr">number</span><span class="p">,</span>
    <span class="nx">maxSize</span><span class="p">:</span> <span class="kr">number</span><span class="p">,</span>
<span class="p">):</span> <span class="kr">number</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">length</span> <span class="o">&lt;=</span> <span class="nx">minSize</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nx">data</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kd">const</span> <span class="nx">bits</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">floor</span><span class="p">(</span><span class="nb">Math</span><span class="p">.</span><span class="nx">log2</span><span class="p">(</span><span class="nx">avgSize</span><span class="p">));</span>
    <span class="kd">const</span> <span class="nx">maskS</span> <span class="o">=</span> <span class="nx">MASKS</span><span class="p">[</span><span class="nx">bits</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span> <span class="c1">// Stricter</span>
    <span class="kd">const</span> <span class="nx">maskL</span> <span class="o">=</span> <span class="nx">MASKS</span><span class="p">[</span><span class="nx">bits</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span> <span class="c1">// Looser</span>

    <span class="kd">let</span> <span class="nx">remaining</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">min</span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">length</span><span class="p">,</span> <span class="nx">maxSize</span><span class="p">);</span>
    <span class="kd">let</span> <span class="nx">center</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">min</span><span class="p">(</span><span class="nx">avgSize</span><span class="p">,</span> <span class="nx">remaining</span><span class="p">);</span>
    <span class="kd">let</span> <span class="nx">index</span> <span class="o">=</span> <span class="nx">minSize</span><span class="p">;</span>
    <span class="kd">let</span> <span class="nx">hash</span> <span class="o">=</span> <span class="mi">0</span><span class="nx">n</span><span class="p">;</span>

    <span class="c1">// Phase 1: strict mask until center</span>
    <span class="k">while</span> <span class="p">(</span><span class="nx">index</span> <span class="o">&lt;</span> <span class="nx">center</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">hash</span> <span class="o">=</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="nx">n</span><span class="p">)</span> <span class="o">+</span> <span class="nx">GEAR</span><span class="p">[</span><span class="nx">data</span><span class="p">[</span><span class="nx">index</span><span class="p">]])</span> <span class="o">&amp;</span> <span class="nx">MASK_64</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&amp;</span> <span class="nx">maskS</span><span class="p">)</span> <span class="o">===</span> <span class="mi">0</span><span class="nx">n</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="nx">index</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="nx">index</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Phase 2: loose mask until max</span>
    <span class="k">while</span> <span class="p">(</span><span class="nx">index</span> <span class="o">&lt;</span> <span class="nx">remaining</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">hash</span> <span class="o">=</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="nx">n</span><span class="p">)</span> <span class="o">+</span> <span class="nx">GEAR</span><span class="p">[</span><span class="nx">data</span><span class="p">[</span><span class="nx">index</span><span class="p">]])</span> <span class="o">&amp;</span> <span class="nx">MASK_64</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&amp;</span> <span class="nx">maskL</span><span class="p">)</span> <span class="o">===</span> <span class="mi">0</span><span class="nx">n</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="nx">index</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="nx">index</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="nx">remaining</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

  </div>
</div>

<p>The 2016 algorithm’s one-byte sliding window is already fast, but the 2020 revision found a way to cut the number of loop iterations in half.</p>

<h3 id="the-2020-enhancement-rolling-two-bytes">The 2020 Enhancement: Rolling Two Bytes</h3>

<p>The 2020 paper introduces a simple optimization: <strong>process two adjacent bytes per step</strong> as the sliding window moves across the data. Since two consecutive single-bit shifts are equivalent to one two-bit shift, the hash updates for two adjacent bytes can be collapsed:</p>

<p>Instead of two separate steps:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hash = (hash &lt;&lt; 1) + GEAR[byte1]     // step 1
hash = (hash &lt;&lt; 1) + GEAR[byte2]     // step 2
</code></pre></div></div>

<p>Collapse into one:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hash = (hash &lt;&lt; 2) + GEAR_LS[byte1]  // Left-shifted GEAR table
hash = hash + GEAR[byte2]            // Regular GEAR table
</code></pre></div></div>

<p>The boundary check still happens after each byte, but each check uses a different mask. After the first byte, the hash bits sit one position higher than they would in the 2016 algorithm, so a left-shifted mask (<code class="language-plaintext highlighter-rouge">mask_s_ls</code>) is needed to check the equivalent bits. After the second byte, the bits realign with the original positions, so the regular mask (<code class="language-plaintext highlighter-rouge">mask_s</code>) is used. The results are identical to processing one byte at a time.</p>

<p>Where does the speedup come from? Each time the sliding window advances one step, the CPU must increment the position, evaluate whether the window has reached a size limit, and branch back to process the next position. By processing two bytes per step, this bookkeeping happens half as often. The two single-bit shifts also collapse into one two-bit shift instruction. These savings are small per step, but across millions of bytes they add up.</p>

<div class="code-tabs" id="fastcdc-2020-code">
  <div class="code-tab-buttons">
    <button class="code-tab-btn active" data-lang="pseudocode">Pseudocode</button>
    <button class="code-tab-btn" data-lang="rust">Rust</button>
    <button class="code-tab-btn" data-lang="typescript">TypeScript</button>
  </div>

  <div class="code-tab-content active" data-lang="pseudocode">

<figure class="highlight"><pre><code class="language-text" data-lang="text"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
</pre></td><td class="code"><pre>// Pre-compute left-shifted GEAR table
GEAR_LS[i] = GEAR[i] &lt;&lt; 1  // for all i

function find_chunk_boundary_2020(data, min_size, avg_size, max_size):
    position = min_size / 2  // Start at half (we process 2 bytes/iter)
    hash = 0

    // Phase 1: Strict mask, two bytes at a time
    while position &lt; avg_size / 2:
        byte_pos = position * 2

        // First byte: shift by 2, add left-shifted value
        hash = (hash &lt;&lt; 2) + GEAR_LS[data[byte_pos]]
        if (hash &amp; mask_s_ls) == 0:
            return byte_pos

        // Second byte: add regular value
        hash = hash + GEAR[data[byte_pos + 1]]
        if (hash &amp; mask_s) == 0:
            return byte_pos + 1

        position += 1

    // Phase 2: Loose mask (similar structure)
    // ... same pattern with mask_l
</pre></td></tr></tbody></table></code></pre></figure>

  </div>

  <div class="code-tab-content" data-lang="rust">

<figure class="highlight"><pre><code class="language-rust" data-lang="rust"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
</pre></td><td class="code"><pre><span class="c1">// From fastcdc-rs v2020 module</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">cut_gear</span><span class="p">(</span>
    <span class="n">source</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">u8</span><span class="p">],</span>
    <span class="n">min_size</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span>
    <span class="n">avg_size</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span>
    <span class="n">max_size</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span>
    <span class="n">mask_s</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="n">mask_l</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="n">mask_s_ls</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>        <span class="c1">// Left-shifted strict mask</span>
    <span class="n">mask_l_ls</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>        <span class="c1">// Left-shifted loose mask</span>
    <span class="n">gear</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">u64</span><span class="p">;</span> <span class="mi">256</span><span class="p">],</span>
    <span class="n">gear_ls</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">u64</span><span class="p">;</span> <span class="mi">256</span><span class="p">],</span>  <span class="c1">// Left-shifted GEAR table</span>
<span class="p">)</span> <span class="k">-&gt;</span> <span class="p">(</span><span class="nb">u64</span><span class="p">,</span> <span class="nb">usize</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">remaining</span> <span class="o">=</span> <span class="n">source</span><span class="nf">.len</span><span class="p">();</span>
    <span class="k">if</span> <span class="n">remaining</span> <span class="o">&lt;=</span> <span class="n">min_size</span> <span class="p">{</span>
        <span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">remaining</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">let</span> <span class="k">mut</span> <span class="n">center</span> <span class="o">=</span> <span class="n">avg_size</span><span class="p">;</span>
    <span class="k">if</span> <span class="n">remaining</span> <span class="o">&gt;</span> <span class="n">max_size</span> <span class="p">{</span>
        <span class="n">remaining</span> <span class="o">=</span> <span class="n">max_size</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">remaining</span> <span class="o">&lt;</span> <span class="n">center</span> <span class="p">{</span>
        <span class="n">center</span> <span class="o">=</span> <span class="n">remaining</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">let</span> <span class="k">mut</span> <span class="n">index</span> <span class="o">=</span> <span class="n">min_size</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">hash</span><span class="p">:</span> <span class="nb">u64</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="c1">// Phase 1: strict mask, two bytes per iteration</span>
    <span class="k">while</span> <span class="n">index</span> <span class="o">&lt;</span> <span class="n">center</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">a</span> <span class="o">=</span> <span class="n">index</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>

        <span class="c1">// First byte</span>
        <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">2</span><span class="p">)</span><span class="nf">.wrapping_add</span><span class="p">(</span><span class="n">gear_ls</span><span class="p">[</span><span class="n">source</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_s_ls</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">hash</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="c1">// Second byte</span>
        <span class="n">hash</span> <span class="o">=</span> <span class="n">hash</span><span class="nf">.wrapping_add</span><span class="p">(</span><span class="n">gear</span><span class="p">[</span><span class="n">source</span><span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_s</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">hash</span><span class="p">,</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Phase 2: loose mask (same pattern)</span>
    <span class="k">while</span> <span class="n">index</span> <span class="o">&lt;</span> <span class="n">remaining</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">a</span> <span class="o">=</span> <span class="n">index</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">2</span><span class="p">)</span><span class="nf">.wrapping_add</span><span class="p">(</span><span class="n">gear_ls</span><span class="p">[</span><span class="n">source</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_l_ls</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">hash</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>
        <span class="p">}</span>
        <span class="n">hash</span> <span class="o">=</span> <span class="n">hash</span><span class="nf">.wrapping_add</span><span class="p">(</span><span class="n">gear</span><span class="p">[</span><span class="n">source</span><span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_l</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">hash</span><span class="p">,</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
        <span class="p">}</span>
        <span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="p">(</span><span class="n">hash</span><span class="p">,</span> <span class="n">remaining</span><span class="p">)</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

  </div>

  <div class="code-tab-content" data-lang="typescript">

<figure class="highlight"><pre><code class="language-typescript" data-lang="typescript"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
</pre></td><td class="code"><pre><span class="c1">// Pre-computed left-shifted table</span>
<span class="kd">const</span> <span class="nx">GEAR_LS</span><span class="p">:</span> <span class="nx">bigint</span><span class="p">[]</span> <span class="o">=</span> <span class="nx">GEAR</span><span class="p">.</span><span class="nx">map</span><span class="p">(</span><span class="nx">g</span> <span class="o">=&gt;</span> <span class="p">(</span><span class="nx">g</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="nx">n</span><span class="p">)</span> <span class="o">&amp;</span> <span class="nx">MASK_64</span><span class="p">);</span>

<span class="kd">function</span> <span class="nx">findChunkBoundary2020</span><span class="p">(</span>
    <span class="nx">data</span><span class="p">:</span> <span class="nb">Uint8Array</span><span class="p">,</span>
    <span class="nx">minSize</span><span class="p">:</span> <span class="kr">number</span><span class="p">,</span>
    <span class="nx">avgSize</span><span class="p">:</span> <span class="kr">number</span><span class="p">,</span>
    <span class="nx">maxSize</span><span class="p">:</span> <span class="kr">number</span><span class="p">,</span>
<span class="p">):</span> <span class="kr">number</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">length</span> <span class="o">&lt;=</span> <span class="nx">minSize</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nx">data</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kd">const</span> <span class="nx">bits</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">floor</span><span class="p">(</span><span class="nb">Math</span><span class="p">.</span><span class="nx">log2</span><span class="p">(</span><span class="nx">avgSize</span><span class="p">));</span>
    <span class="kd">const</span> <span class="nx">maskS</span> <span class="o">=</span> <span class="nx">MASKS</span><span class="p">[</span><span class="nx">bits</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span>
    <span class="kd">const</span> <span class="nx">maskL</span> <span class="o">=</span> <span class="nx">MASKS</span><span class="p">[</span><span class="nx">bits</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
    <span class="kd">const</span> <span class="nx">maskSLs</span> <span class="o">=</span> <span class="nx">maskS</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="nx">n</span><span class="p">;</span>  <span class="c1">// Left-shifted masks</span>
    <span class="kd">const</span> <span class="nx">maskLLs</span> <span class="o">=</span> <span class="nx">maskL</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="nx">n</span><span class="p">;</span>

    <span class="kd">let</span> <span class="nx">remaining</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">min</span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">length</span><span class="p">,</span> <span class="nx">maxSize</span><span class="p">);</span>
    <span class="kd">let</span> <span class="nx">center</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">min</span><span class="p">(</span><span class="nx">avgSize</span><span class="p">,</span> <span class="nx">remaining</span><span class="p">);</span>
    <span class="kd">let</span> <span class="nx">index</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">floor</span><span class="p">(</span><span class="nx">minSize</span> <span class="o">/</span> <span class="mi">2</span><span class="p">);</span>
    <span class="kd">let</span> <span class="nx">hash</span> <span class="o">=</span> <span class="mi">0</span><span class="nx">n</span><span class="p">;</span>

    <span class="c1">// Phase 1: strict mask, two bytes at a time</span>
    <span class="k">while</span> <span class="p">(</span><span class="nx">index</span> <span class="o">&lt;</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">floor</span><span class="p">(</span><span class="nx">center</span> <span class="o">/</span> <span class="mi">2</span><span class="p">))</span> <span class="p">{</span>
        <span class="kd">const</span> <span class="nx">a</span> <span class="o">=</span> <span class="nx">index</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>

        <span class="c1">// First byte</span>
        <span class="nx">hash</span> <span class="o">=</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">2</span><span class="nx">n</span><span class="p">)</span> <span class="o">+</span> <span class="nx">GEAR_LS</span><span class="p">[</span><span class="nx">data</span><span class="p">[</span><span class="nx">a</span><span class="p">]])</span> <span class="o">&amp;</span> <span class="nx">MASK_64</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&amp;</span> <span class="nx">maskSLs</span><span class="p">)</span> <span class="o">===</span> <span class="mi">0</span><span class="nx">n</span><span class="p">)</span> <span class="k">return</span> <span class="nx">a</span><span class="p">;</span>

        <span class="c1">// Second byte</span>
        <span class="nx">hash</span> <span class="o">=</span> <span class="p">(</span><span class="nx">hash</span> <span class="o">+</span> <span class="nx">GEAR</span><span class="p">[</span><span class="nx">data</span><span class="p">[</span><span class="nx">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]])</span> <span class="o">&amp;</span> <span class="nx">MASK_64</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&amp;</span> <span class="nx">maskS</span><span class="p">)</span> <span class="o">===</span> <span class="mi">0</span><span class="nx">n</span><span class="p">)</span> <span class="k">return</span> <span class="nx">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>

        <span class="nx">index</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Phase 2: loose mask</span>
    <span class="k">while</span> <span class="p">(</span><span class="nx">index</span> <span class="o">&lt;</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">floor</span><span class="p">(</span><span class="nx">remaining</span> <span class="o">/</span> <span class="mi">2</span><span class="p">))</span> <span class="p">{</span>
        <span class="kd">const</span> <span class="nx">a</span> <span class="o">=</span> <span class="nx">index</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="nx">hash</span> <span class="o">=</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">2</span><span class="nx">n</span><span class="p">)</span> <span class="o">+</span> <span class="nx">GEAR_LS</span><span class="p">[</span><span class="nx">data</span><span class="p">[</span><span class="nx">a</span><span class="p">]])</span> <span class="o">&amp;</span> <span class="nx">MASK_64</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&amp;</span> <span class="nx">maskLLs</span><span class="p">)</span> <span class="o">===</span> <span class="mi">0</span><span class="nx">n</span><span class="p">)</span> <span class="k">return</span> <span class="nx">a</span><span class="p">;</span>
        <span class="nx">hash</span> <span class="o">=</span> <span class="p">(</span><span class="nx">hash</span> <span class="o">+</span> <span class="nx">GEAR</span><span class="p">[</span><span class="nx">data</span><span class="p">[</span><span class="nx">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]])</span> <span class="o">&amp;</span> <span class="nx">MASK_64</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">((</span><span class="nx">hash</span> <span class="o">&amp;</span> <span class="nx">maskL</span><span class="p">)</span> <span class="o">===</span> <span class="mi">0</span><span class="nx">n</span><span class="p">)</span> <span class="k">return</span> <span class="nx">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="nx">index</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="nx">remaining</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

  </div>
</div>

<div class="cdc-callout" data-label="Performance Note">
The 2020 optimization increases chunking throughput by 30-40% over the 2016 version <span class="cdc-cite"><a href="#ref-6">[6]</a></span>. In practice, CDC is rarely the bottleneck in a deduplication pipeline. Reading data from disk and computing cryptographic hashes for each chunk (used to identify duplicates) are typically the slower steps.
</div>

<p>Both versions produce the same chunk boundaries for the same input and parameters. The difference is purely mechanical: the 2020 version reaches those boundaries faster by doing two bytes of work per loop iteration. With the algorithm itself understood, the next question is practical: how do the parameters you choose actually affect the chunks that come out?</p>

<h3 id="exploring-the-parameters">Exploring the Parameters</h3>

<p>The target average chunk size is the primary parameter when configuring FastCDC. A smaller average means more chunks (better deduplication granularity but more metadata), while a larger average means fewer chunks (less overhead but coarser deduplication). Drag the slider below to see how FastCDC re-chunks the same text at different target sizes:</p>

<div class="cdc-viz" id="parametric-demo">
  <div class="cdc-viz-header">
    <div class="cdc-viz-title">Parametric Chunking Explorer</div>
    <p class="cdc-viz-hint">See how target average size affects chunk boundaries and size distribution.</p>
  </div>
  <!-- Slider control -->
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Target Average: <strong id="parametric-slider-value">88</strong> bytes
    </span>
    <input type="range" id="parametric-slider" min="48" max="128" value="88" step="2" />
    <span class="parametric-derived-params" id="parametric-derived-params">(min: 44, max: 264)</span>
  </div>

  <!-- Text with chunk highlighting -->
  <div id="parametric-text-display" class="cdc-content cdc-text-view"></div>

  <!-- Chunk summary section -->
  <div class="parametric-section-header">
    <div class="parametric-section-title">Chunk Summary</div>
    <div class="parametric-summary">
      <span id="parametric-stat-count">--</span> chunks · target <span id="parametric-stat-target">--</span> · avg <span id="parametric-stat-actual">--</span> · min <span id="parametric-stat-min">--</span> · max <span id="parametric-stat-max">--</span>
    </div>
  </div>
  <div class="comparison-blocks-hint">Each bar is one chunk. Height and width show relative size (dashed line = target).</div>
  <div id="parametric-blocks-bar" class="cdc-blocks-view"></div>
</div>

<p>Notice how the dual-mask strategy keeps chunk sizes clustered around the target average. At small targets (16-32 bytes) you get many chunks, and the distribution chart reveals the normalized shape. At large targets (96-128 bytes) the entire text may collapse into just a few chunks.</p>

<p>To see why normalization matters, consider what a single mask does. Basic CDC checks the same bit pattern from minimum size all the way to maximum size. Each byte after the minimum has an independent 1/avgSize probability of triggering a boundary. Because the algorithm cuts at the <em>first</em> match, most chunks end early: the chance of reaching any given byte without a match drops exponentially. This produces a geometric distribution skewed toward small chunks, with a few chunks surviving long enough to reach the maximum size. The FastCDC paper addresses this with normalization levels (NC1 through NC3), which control how aggressively the two masks differ from the base probability. At NC2 (the paper’s recommended level), the strict mask is 4x harder to trigger than the single-mask baseline, suppressing early cuts below the target average, while the loose mask is 4x easier, catching chunks shortly after they pass it. The result is a tight cluster around the target rather than a skewed spread.</p>

<p>Compare the two approaches below. Both chunk the same 8 KB of pseudo-random bytes (generated from a fixed seed), using the same Gear hash and the same target parameters. The only difference is how the mask is applied. Random data makes the statistical properties of each algorithm clearly visible because natural language text has too much structure to reveal the distribution shapes at this scale.</p>

<p>The density curve beneath each chunked block view shows the distribution of chunk sizes: the horizontal axis is chunk size in bytes, the vertical axis is how likely a chunk of that size is, and the dashed line marks the target average. A tall, narrow peak means most chunks land near the same size; a long tail trailing to the right means many chunks end up much larger than the target:</p>

<div class="cdc-viz" id="comparison-demo">
  <div class="cdc-viz-header">
    <div class="cdc-viz-title">Basic vs Normalized Chunk Size Distribution</div>
    <p class="cdc-viz-hint">Compare how single-mask and dual-mask strategies distribute chunk sizes across the same data.</p>
  </div>
  <!-- Shared slider -->
  <div class="parametric-control-row">
    <span class="parametric-control-label">
      Target Average: <strong id="comparison-slider-value">88</strong> bytes
    </span>
    <input type="range" id="comparison-slider" min="48" max="128" value="88" step="2" />
    <span class="parametric-derived-params" id="comparison-derived-params">(min: 44, max: 264)</span>
  </div>

  <!-- Two-column comparison -->
  <div class="comparison-columns">
    <!-- Left: Basic CDC -->
    <div class="comparison-col">
      <div class="comparison-label"><span class="comparison-label-text">Basic CDC <span class="comparison-sublabel">(Single Mask)</span></span>
        <span class="comparison-summary" id="comparison-basic-stats">
          <span id="comparison-basic-stat-count">--</span> chunks · avg <span id="comparison-basic-stat-actual">--</span> · min <span id="comparison-basic-stat-min">--</span> · max <span id="comparison-basic-stat-max">--</span>
        </span>
      </div>
      <div class="comparison-blocks-hint">Each bar is one chunk. Height and width show relative size (dashed line = target).</div>
      <div id="comparison-basic-blocks" class="cdc-blocks-view"></div>
      <div class="comparison-blocks-hint">Density curve: higher peaks mean more chunks of that size. Dashed line marks the target average.</div>
      <div id="comparison-basic-distribution" class="parametric-distribution-chart"></div>
    </div>

    <!-- Right: Normalized CDC -->
    <div class="comparison-col">
      <div class="comparison-label"><span class="comparison-label-text">Normalized CDC <span class="comparison-sublabel">(Dual Mask)</span></span>
        <span class="comparison-summary" id="comparison-normalized-stats">
          <span id="comparison-normalized-stat-count">--</span> chunks · avg <span id="comparison-normalized-stat-actual">--</span> · min <span id="comparison-normalized-stat-min">--</span> · max <span id="comparison-normalized-stat-max">--</span>
        </span>
      </div>
      <div class="comparison-blocks-hint">Each bar is one chunk. Height and width show relative size (dashed line = target).</div>
      <div id="comparison-normalized-blocks" class="cdc-blocks-view"></div>
      <div class="comparison-blocks-hint">Density curve: higher peaks mean more chunks of that size. Dashed line marks the target average.</div>
      <div id="comparison-normalized-distribution" class="parametric-distribution-chart"></div>
    </div>
  </div>
</div>

<p>Basic CDC’s single mask produces chunks that follow an exponential distribution: many small chunks and a long tail of large ones. FastCDC’s dual-mask normalization clusters chunks tightly around the target average, reducing both extremes. This narrower distribution means less wasted metadata on tiny chunks and fewer oversized chunks that dilute deduplication.</p>

<p>FastCDC gives us a chunking algorithm that is fast, produces well-distributed chunk sizes, and, most importantly, generates stable boundaries that survive local edits. But chunking is only the first stage of a deduplication pipeline. Once data is split into chunks, each chunk needs a cryptographic fingerprint, those fingerprints need to be indexed and looked up efficiently, and duplicate chunks need to be eliminated during storage or transmission. The choices made at each stage (hash function, index structure, storage layout) interact with the chunking layer in ways that matter for real-world performance.</p>

<p>In the next post, <a href="/writings/content-defined-chunking-part-3">Part 3: Deduplication in Action</a>, we’ll build on FastCDC to walk through the deduplication pipeline end to end: fingerprinting chunks, detecting duplicates, and examining the cost tradeoffs that shape how these systems perform in practice.</p>

<hr />

<h3 id="references">References</h3>

<div class="cdc-references">

<div class="bib-entry" id="ref-5">
  <div class="bib-number">[5]</div>
  <div class="bib-citation">W. Xia, H. Jiang, D. Feng, L. Tian, M. Fu &amp; Y. Zhou, "FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication," <em>Proceedings of the USENIX Annual Technical Conference (ATC)</em>, 2016.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/fastcdc-2016-xia.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-6">
  <div class="bib-number">[6]</div>
  <div class="bib-citation">W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu &amp; Y. Zhang, "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems," <em>IEEE Transactions on Parallel and Distributed Systems</em>, vol. 31, no. 9, pp. 2017-2031, 2020.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/fastcdc-2020-xia.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://ieeexplore.ieee.org/document/9055082" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE Xplore</a>
  </div>
</div>

</div>

<hr />

<div class="cdc-series-nav">
&larr; <a href="/writings/content-defined-chunking-part-1">Part 1: From Problem to Taxonomy</a> · Continue reading &rarr; <a href="/writings/content-defined-chunking-part-3">Part 3: Deduplication in Action</a>
</div>

<script type="module" src="/assets/js/cdc-animations.js"></script>

]]>
      </content:encoded>
    </item>
    
    <item>
      <title>From Problem to Taxonomy</title>
      <link>https://rickwinfrey.com/writings/content-defined-chunking-part-1</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/content-defined-chunking-part-1</guid>
      <pubDate>Mon, 02 Feb 2026 12:00:00 +0000</pubDate>
      
      <description>An introduction to content-defined chunking: why fixed-size splitting fails, how content-aware boundaries solve the deduplication problem, and a taxonomy of three CDC algorithm families.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<style>
/* ==========================================================================
   CDC Animation Styles
/* View mode tabs (Text / Blocks / Hex) */
.cdc-view-tabs {
  display: flex;
  gap: 0.25rem;
  background: rgba(61, 58, 54, 0.05);
  padding: 0.25rem;
  border-radius: 6px;
}

.cdc-view-tab {
  padding: 0.4rem 0.75rem;
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.8rem;
  color: #8b7355;
  background: transparent;
  border: none;
  border-radius: 4px;
  cursor: pointer;
  transition: all 0.15s ease;
}

.cdc-view-tab:hover {
  color: #3d3a36;
}

.cdc-view-tab.active {
  background: #fff;
  color: #c45a3b;
  box-shadow: 0 1px 3px rgba(0,0,0,0.1);
}

/* Content display area */
.cdc-content {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 1rem;
  line-height: 1.8;
  color: #3d3a36;
}

/* Text view with chunk highlighting */
.cdc-text-view {
  white-space: pre-wrap;
  word-break: break-word;
}

.cdc-text-view .chunk {
  display: inline;
  padding: 0.1rem 0;
  border-radius: 2px;
  transition: background-color 0.2s ease;
}

/* Chunk colors - warm palette matching site */
.cdc-text-view .chunk-0 { background-color: rgba(196, 90, 59, 0.15); }
.cdc-text-view .chunk-1 { background-color: rgba(212, 165, 116, 0.25); }
.cdc-text-view .chunk-2 { background-color: rgba(139, 115, 85, 0.15); }
.cdc-text-view .chunk-3 { background-color: rgba(196, 90, 59, 0.25); }
.cdc-text-view .chunk-4 { background-color: rgba(212, 165, 116, 0.15); }
.cdc-text-view .chunk-5 { background-color: rgba(139, 115, 85, 0.25); }

/* Block view */
.cdc-blocks-view {
  display: flex;
  align-items: stretch;
  gap: 2px;
  margin-top: 1rem;
  padding: 0.5rem 0;
  width: 100%;
}

.cdc-block {
  height: 24px;
  border-radius: 3px;
  transition: all 0.2s ease;
  position: relative;
}

.cdc-block.chunk-0 { background-color: #c45a3b; }
.cdc-block.chunk-1 { background-color: #d4a574; }
.cdc-block.chunk-2 { background-color: #8b7355; }
.cdc-block.chunk-3 { background-color: #c45a3b; opacity: 0.7; }
.cdc-block.chunk-4 { background-color: #d4a574; opacity: 0.7; }
.cdc-block.chunk-5 { background-color: #8b7355; opacity: 0.7; }

.cdc-block:hover {
  transform: scaleY(1.2);
  z-index: 1;
}

/* Hex view */
.cdc-hex-view {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  line-height: 1.6;
  display: flex;
  flex-wrap: wrap;
  gap: 0.5rem 1rem;
}

.cdc-hex-byte {
  padding: 0.15rem 0.3rem;
  border-radius: 2px;
}

/* File icon visualization for fixed vs CDC comparison */
.cdc-file-icon {
  position: relative;
  background: #fff;
  border: 1px solid #d0d0d0;
  border-radius: 3px;
  padding: 1.5rem;
  padding-top: 2.25rem;
  margin: 0.75rem 0;
  box-shadow: 0 2px 8px rgba(0, 0, 0, 0.08);
}

/* Folded corner effect */
.cdc-file-corner {
  position: absolute;
  top: 0;
  right: 0;
  width: 0;
  height: 0;
  border-style: solid;
  border-width: 0 24px 24px 0;
  border-color: transparent #faf9f7 transparent transparent;
  filter: drop-shadow(-1px 1px 1px rgba(0, 0, 0, 0.1));
}

.cdc-file-corner::before {
  content: '';
  position: absolute;
  top: 0;
  right: -24px;
  width: 0;
  height: 0;
  border-style: solid;
  border-width: 0 0 24px 24px;
  border-color: transparent transparent #e8e8e8 transparent;
}

.cdc-file-label {
  position: absolute;
  top: 0.6rem;
  left: 1rem;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  color: #8b7355;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.cdc-file-content {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  line-height: 2.2;
  color: #3d3a36;
  white-space: pre-wrap;
  word-break: break-word;
}

.cdc-chunk-explanation {
  font-size: 0.8rem;
  color: #8b7355;
  margin: 0.25rem 0 0.5rem 0;
  font-style: italic;
}

/* Chunk spans with box styling - matches CHUNK_SOLID_COLORS from cdc-animations.js */
.cdc-chunk {
  padding: 0.2rem 0.35rem;
  border-radius: 3px;
  border: 2px solid;
  display: inline;
  box-decoration-break: clone;
  -webkit-box-decoration-break: clone;
}

.cdc-chunk.chunk-a {
  background: rgba(196, 90, 59, 0.15);
  border-color: #c45a3b;
}

.cdc-chunk.chunk-b {
  background: rgba(90, 138, 90, 0.15);
  border-color: #5a8a5a;
}

.cdc-chunk.chunk-c {
  background: rgba(70, 110, 160, 0.15);
  border-color: #466ea0;
}

.cdc-chunk.chunk-d {
  background: rgba(160, 100, 50, 0.15);
  border-color: #a06432;
}

.cdc-chunk.chunk-e {
  background: rgba(130, 80, 150, 0.15);
  border-color: #825096;
}

/* New chunk - terracotta accent to match interactive demos */
.cdc-chunk.chunk-new {
  background: rgba(196, 90, 59, 0.2);
  border-color: #c45a3b;
  border-style: solid;
}

/* Unchanged chunk - muted gray, matches shared/dedup style in animations */
.cdc-chunk.unchanged {
  background: rgba(61, 58, 54, 0.06);
  border-color: rgba(61, 58, 54, 0.2);
  color: #8b8178;
}

/* Changed chunk - dashed border to signal the chunk content shifted */
.cdc-chunk.changed {
  border-style: dashed;
}

/* Chunk Comparison Demo (JS-powered before/after) */
.cdc-chunk-comparison-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  color: #8b7355;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  margin-bottom: 0.4rem;
}

.cdc-chunk-comparison-file {
  margin-bottom: 0.75rem;
}

.cdc-chunk-comparison-text {
  white-space: pre-wrap;
  word-break: break-word;
  padding: 0.75rem;
  background: rgba(61, 58, 54, 0.02);
  border-radius: 6px;
  border: 1px solid rgba(61, 58, 54, 0.06);
  margin-bottom: 0.5rem;
  font-size: 0.85rem;
  line-height: 1.6;
}

.cdc-cmp-chunk {
  padding: 0.15rem 0.25rem;
  border-radius: 3px;
  border: 2px solid;
  display: inline-block;
  cursor: default;
  transition: filter 0.1s ease;
}

.cdc-cmp-chunk.unchanged {
  background: rgba(61, 58, 54, 0.06);
  border-color: rgba(61, 58, 54, 0.2);
  color: #8b8178;
}

.cdc-cmp-chunk.new {
  border-style: solid;
}

.cdc-cmp-chunk.chunk-hover {
  filter: brightness(0.82);
  outline: 3px solid rgba(61, 58, 54, 0.5);
  outline-offset: 0px;
  box-shadow: 0 0 8px rgba(0, 0, 0, 0.15);
}

.cdc-cmp-chunk.unchanged.chunk-hover {
  filter: brightness(0.85);
  outline: 3px solid rgba(61, 58, 54, 0.4);
  background: rgba(61, 58, 54, 0.15);
}

/* Chunk wrapper: label above, text below */
.cdc-cmp-chunk-wrapper {
  display: inline-flex;
  flex-direction: column;
  align-items: center;
  vertical-align: top;
  margin: 0.15rem 0.2rem;
}

.cdc-chunk-summary {
  display: grid;
  grid-template-columns: repeat(4, 1fr);
  gap: 0.75rem;
  padding: 0.75rem;
  margin: 0.5rem 0;
  background: rgba(61, 58, 54, 0.03);
  border-radius: 6px;
  border: 1px solid rgba(61, 58, 54, 0.06);
}

.cdc-chunk-summary-stat {
  text-align: center;
}

.cdc-chunk-summary-value {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 1.1rem;
  font-weight: 600;
  line-height: 1.2;
}

.cdc-chunk-summary-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.65rem;
  color: #8b7355;
  margin-top: 0.2rem;
  text-transform: uppercase;
  letter-spacing: 0.04em;
}

.cdc-cmp-chunk-label {
  display: block;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  font-weight: 600;
  text-align: center;
  letter-spacing: 0.02em;
  margin-bottom: 0.15rem;
}

/* Edit indicator arrow */
.cdc-edit-indicator {
  text-align: center;
  font-size: 0.8rem;
  color: #8b7355;
  padding: 0.5rem 0;
}

/* Deduplication result */
.cdc-dedup-result {
  text-align: center;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: 600;
  padding: 0.75rem;
  border-radius: 6px;
  margin-top: 0.75rem;
}

.cdc-dedup-result.bad {
  background: rgba(196, 90, 59, 0.1);
  color: #a84832;
}

.cdc-dedup-result.good {
  background: rgba(90, 160, 90, 0.1);
  color: #3d8b3d;
}

/* Rolling window indicator */
.cdc-window {
  position: absolute;
  height: 100%;
  background: rgba(196, 90, 59, 0.3);
  border: 2px solid #c45a3b;
  border-radius: 4px;
  pointer-events: none;
  transition: left 0.1s ease;
}

/* Hash display */
.cdc-hash-display {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.8rem;
  color: #8b7355;
  min-height: 1.4em;
}

.cdc-hash-display strong {
  color: #c45a3b;
}

/* Controls panel */
.cdc-controls {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(140px, 1fr));
  gap: 1.25rem;
  padding: 1.25rem;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-top: none;
  border-radius: 0 0 8px 8px;
}

.cdc-control-group {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.cdc-control-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  color: #3d3a36;
}

.cdc-controls input[type="range"] {
  width: 100%;
  height: 6px;
  -webkit-appearance: none;
  appearance: none;
  background: linear-gradient(to right, #d4a574, #c45a3b);
  border-radius: 3px;
  outline: none;
}

.cdc-controls input[type="range"]::-webkit-slider-thumb {
  -webkit-appearance: none;
  appearance: none;
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
  transition: transform 0.15s ease;
}

.cdc-controls input[type="range"]::-webkit-slider-thumb:hover {
  transform: scale(1.1);
}

.cdc-controls input[type="range"]::-moz-range-thumb {
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  border: none;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
}

/* Playback controls */
.cdc-playback {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  padding: 1rem 1.25rem;
  background: rgba(61, 58, 54, 0.02);
  border-top: 1px solid rgba(61, 58, 54, 0.08);
}

.cdc-playback-btn {
  width: 36px;
  height: 36px;
  border-radius: 50%;
  border: none;
  background: #c45a3b;
  color: #fff;
  cursor: pointer;
  display: flex;
  align-items: center;
  justify-content: center;
  transition: all 0.15s ease;
}

.cdc-playback-btn:hover {
  background: #a84832;
  transform: scale(1.05);
}

.cdc-playback-btn.secondary {
  background: rgba(61, 58, 54, 0.1);
  color: #3d3a36;
}

.cdc-playback-btn.secondary:hover {
  background: rgba(61, 58, 54, 0.2);
}

.cdc-speed-control {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-left: auto;
}

.cdc-speed-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.8rem;
  color: #8b7355;
}

/* Progress indicator */
.cdc-progress {
  flex: 1;
  height: 4px;
  background: rgba(61, 58, 54, 0.1);
  border-radius: 2px;
  overflow: hidden;
  margin: 0 0.5rem;
}

.cdc-progress-bar {
  height: 100%;
  background: linear-gradient(to right, #d4a574, #c45a3b);
  border-radius: 2px;
  transition: width 0.1s ease;
}

/* Side-by-side comparison */
.cdc-comparison {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
  margin: 2rem 0;
}

@media (max-width: 50em) {
  .cdc-comparison {
    grid-template-columns: 1fr;
  }
}

.cdc-comparison-panel {
  padding: 1.25rem;
  background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-comparison-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 1rem;
  font-weight: 600;
  color: #3d3a36;
  margin-bottom: 1rem;
  padding-bottom: 0.75rem;
  border-bottom: 1px solid rgba(61, 58, 54, 0.1);
}

/* Chunk boundary marker */
.cdc-boundary-marker {
  display: inline-block;
  width: 2px;
  height: 1.2em;
  background: #c45a3b;
  margin: 0 1px;
  vertical-align: middle;
  border-radius: 1px;
}

/* Stats display */
.cdc-stats {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(120px, 1fr));
  gap: 1rem;
  padding: 1rem;
  background: rgba(61, 58, 54, 0.02);
  border-radius: 6px;
  margin-top: 1rem;
}

.cdc-stat {
  text-align: center;
}

.cdc-stat-value {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 1.5rem;
  font-weight: 600;
  color: #c45a3b;
}

.cdc-stat-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  color: #8b7355;
  margin-top: 0.25rem;
}

/* Deduplication visualization */
.cdc-dedup-viz {
  margin: 2rem 0;
}

.cdc-dedup-files {
  display: grid;
  grid-template-columns: 1fr auto 1fr;
  gap: 1rem;
  align-items: start;
}

.cdc-dedup-arrow {
  display: flex;
  align-items: center;
  justify-content: center;
  padding: 2rem 0;
  color: #8b7355;
  font-size: 1.5rem;
}

.cdc-dedup-storage {
  margin-top: 1.5rem;
  padding: 1.25rem;
  background: linear-gradient(135deg, rgba(196, 90, 59, 0.05) 0%, rgba(212, 165, 116, 0.08) 100%);
  border-radius: 8px;
}

.cdc-dedup-storage-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem;
  font-weight: 600;
  color: #3d3a36;
  margin-bottom: 0.75rem;
}

.cdc-dedup-chunks {
  display: flex;
  flex-wrap: wrap;
  gap: 0.5rem;
}

.cdc-dedup-chunk {
  padding: 0.4rem 0.75rem;
  border-radius: 4px;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  color: #fff;
}

.cdc-dedup-chunk.shared {
  box-shadow: 0 0 0 2px #fff, 0 0 0 4px currentColor;
}

/* Versioned Dedup - Editor */
.cdc-dedup-editor { display: flex; flex-direction: column; gap: 0.75rem; margin-bottom: 1.5rem; }

.cdc-dedup-textarea {
  width: 100%; min-height: 80px; padding: 0.75rem;
  font-family: 'Source Serif 4', Georgia, serif; font-size: 0.9rem; line-height: 1.6;
  color: #3d3a36; background: #fff;
  border: 1px solid rgba(61, 58, 54, 0.2); border-radius: 6px;
  resize: vertical; box-sizing: border-box;
}
.cdc-dedup-textarea:focus { outline: none; border-color: #c45a3b; box-shadow: 0 0 0 2px rgba(196, 90, 59, 0.15); }

.cdc-dedup-save-btn {
  align-self: flex-start; padding: 0.5rem 1.25rem;
  font-family: 'Libre Baskerville', Georgia, serif; font-size: 0.85rem;
  color: #fff; background: #c45a3b; border: none; border-radius: 6px;
  cursor: pointer; transition: background 0.15s ease, transform 0.1s ease;
}
.cdc-dedup-save-btn:hover { background: #a84832; transform: translateY(-1px); }
.cdc-dedup-save-btn:active { transform: translateY(0); }

/* Versioned Dedup - Timeline */
.cdc-dedup-timeline { position: relative; margin-bottom: 1.5rem; }

.cdc-version-entry { display: flex; gap: 1rem; padding-bottom: 1.5rem; position: relative; }

.cdc-version-entry:not(:last-child)::before {
  content: ''; position: absolute; top: 15px; left: 5px;
  width: 2px; bottom: 0; background: rgba(61, 58, 54, 0.15);
}

.cdc-version-dot {
  position: relative; flex-shrink: 0;
  width: 12px; height: 12px; margin-top: 3px;
  background: #c45a3b; border-radius: 50%;
  border: 2px solid #fff; box-shadow: 0 0 0 1px rgba(61, 58, 54, 0.2);
}

.cdc-version-content { flex: 1; min-width: 0; }

.cdc-version-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem; font-weight: 600; color: #3d3a36; margin-bottom: 0.5rem;
}

.cdc-version-cols { display: grid; grid-template-columns: 1fr 180px; gap: 1rem; align-items: start; }

.cdc-version-text {
  white-space: pre-wrap; word-break: break-word;
  padding: 0.5rem; background: rgba(61, 58, 54, 0.02);
  border-radius: 6px; border: 1px solid rgba(61, 58, 54, 0.06);
}

.cdc-version-blocks { display: flex; flex-direction: column; gap: 0.5rem; }
.cdc-version-blocks .cdc-blocks-view { margin-top: 0; min-height: 24px; }

.cdc-version-stats {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem; color: #8b7355; line-height: 1.4;
}

.cdc-dedup-timeline-title {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem; font-weight: 600; color: #8b7355;
  text-transform: uppercase; letter-spacing: 0.06em;
  margin-bottom: 0.75rem;
}

[data-chunk-hash].hash-hover {
  filter: brightness(0.85);
  outline: 2px solid rgba(61, 58, 54, 0.4);
  outline-offset: -1px;
}

@media (max-width: 42em) {
  .cdc-version-cols { grid-template-columns: 1fr; }
}

/* Beginner breadcrumb */
/* Table of Contents */
.cdc-toc {
  margin: 2rem 0;
  padding: 1.25rem 1.5rem;
  background: rgba(61, 58, 54, 0.03);
  border: 1px solid rgba(61, 58, 54, 0.1);
  border-radius: 8px;
}

.cdc-toc strong {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.95rem;
  color: #3d3a36;
}

.cdc-toc ol {
  margin: 0.75rem 0 0 0;
  padding-left: 1.25rem;
}

.cdc-toc li {
  margin-bottom: 0.4rem;
  font-size: 0.9rem;
  line-height: 1.5;
  color: #5a564f;
}

.cdc-toc a {
  color: #c45a3b;
  text-decoration: none;
  font-weight: 600;
}

.cdc-toc a:hover {
  text-decoration: underline;
}

.cdc-toc ul {
  margin: 0.25rem 0 0.25rem 0;
  padding-left: 1.25rem;
  list-style: none;
}

.cdc-toc ul li {
  margin-bottom: 0.15rem;
  font-size: 0.82rem;
  color: #8b7355;
}

.cdc-toc ul li a {
  font-weight: 400;
  color: #8b7355;
}

.cdc-toc ul li a:hover {
  color: #c45a3b;
}

/* Taxonomy tree diagram */
/* Taxonomy comparison table */
.cdc-taxonomy-table {
  margin: 1.5rem 0;
}

.cdc-taxonomy-table table {
  width: 100%;
  border: 1px solid rgba(61, 58, 54, 0.15);
  border-radius: 6px;
  overflow: hidden;
}

.cdc-taxonomy-table th,
.cdc-taxonomy-table td {
  padding: 0.6rem 0.75rem;
  font-size: 0.85rem;
  line-height: 1.5;
  border-left: 1px solid rgba(61, 58, 54, 0.1);
  width: auto;
  text-align: left;
  font-weight: 400;
  background: transparent;
}

.cdc-taxonomy-table td:last-child,
.cdc-taxonomy-table th:last-child {
  border-left: 1px solid rgba(61, 58, 54, 0.1);
  font-weight: 400;
  background: transparent;
}

.cdc-taxonomy-table th:first-child,
.cdc-taxonomy-table td:first-child {
  border-left: none;
}

.cdc-taxonomy-table th {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: 600;
  text-align: center;
}

.cdc-taxonomy-table th.bsw {
  background: rgba(196, 90, 59, 0.1);
  color: #c45a3b;
}

.cdc-taxonomy-table th.extrema {
  background: rgba(42, 125, 79, 0.08);
  color: #2a7d4f;
}

.cdc-taxonomy-table th.statistical {
  background: rgba(139, 115, 85, 0.1);
  color: #8b7355;
}

.cdc-taxonomy-table .row-label {
  font-weight: 600;
  color: #3d3a36;
  white-space: nowrap;
}

.cdc-taxonomy-table .algo-year {
  color: #a89b8c;
  font-size: 0.78rem;
}

.cdc-taxonomy-table tr {
  border-bottom: 1px solid rgba(61, 58, 54, 0.08);
}

.cdc-taxonomy-table thead tr {
  background: transparent;
  border-bottom: 2px solid rgba(61, 58, 54, 0.15);
}

.cdc-taxonomy-table-note {
  margin-top: 0.5rem;
  font-size: 0.72rem;
  color: #a89b8c;
  text-align: center;
  line-height: 1.4;
}

.cdc-taxonomy-table-note a {
  color: #c45a3b;
  text-decoration: none;
  font-weight: 600;
}

.cdc-tax-family-label {
  padding: 0.3rem 0.5rem;
  border-radius: 5px;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.7rem;
  font-weight: 600;
  text-align: center;
  white-space: nowrap;
}

.cdc-tax-family-label.bsw {
  background: rgba(196, 90, 59, 0.12);
  color: #c45a3b;
  border: 1px solid rgba(196, 90, 59, 0.25);
}

.cdc-tax-family-label.extrema {
  background: rgba(42, 125, 79, 0.1);
  color: #2a7d4f;
  border: 1px solid rgba(42, 125, 79, 0.2);
}

.cdc-tax-family-label.statistical {
  background: rgba(139, 115, 85, 0.12);
  color: #8b7355;
  border: 1px solid rgba(139, 115, 85, 0.25);
}

.cdc-references .bib-note {
  font-size: 0.82rem;
  color: #8b7355;
  margin-top: 0.4rem;
  line-height: 1.5;
}

.cdc-learn-more {
  display: inline-flex;
  align-items: center;
  gap: 0.4rem;
  margin-top: 0.75rem;
  padding: 0.4rem 0.75rem;
  background: rgba(212, 165, 116, 0.15);
  border-radius: 4px;
  font-size: 0.8rem;
  font-style: normal;
  color: #8b7355;
}

.cdc-learn-more::before {
  content: "💡";
}

/* Combined text + hex view */
.cdc-combined-view {
  display: flex;
  flex-wrap: wrap;
  gap: 1px;
}

.cdc-byte-col {
  display: flex;
  flex-direction: column;
  align-items: center;
  border-radius: 2px;
  padding: 0.15rem 0.1rem;
}

.cdc-byte-char {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.95rem;
  line-height: 1.4;
  color: #3d3a36;
}

.cdc-byte-hex {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  line-height: 1;
  color: #8b7355;
  margin-top: 1px;
}

/* Block annotation bar (below text/hex views) */
.cdc-block-wrapper {
  display: flex;
  flex-direction: column;
  align-items: center;
}

.cdc-block {
  width: 100%;
}

.cdc-block-annotation {
  width: 100%;
  position: relative;
  margin-top: 0.3rem;
}

.cdc-block-line {
  width: 100%;
  height: 0;
  border-top: 1.5px solid #8b7355;
  opacity: 0.5;
}

.cdc-block-tick {
  position: absolute;
  left: 50%;
  top: 0;
  transform: translateX(-50%);
  width: 1.5px;
  height: 8px;
  background: #8b7355;
  opacity: 0.5;
}

.cdc-block-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #8b7355;
  white-space: nowrap;
  user-select: none;
  line-height: 1;
  text-align: center;
  margin-top: 10px;
  overflow: hidden;
  text-overflow: ellipsis;
}

/* Chunk hover highlights */
.cdc-combined-view .cdc-byte-col.chunk-hover {
  filter: brightness(0.85);
  outline: 1px solid rgba(61, 58, 54, 0.25);
  outline-offset: -1px;
}

.cdc-block-wrapper.chunk-hover .cdc-block {
  filter: brightness(1.15);
  box-shadow: 0 0 6px rgba(0, 0, 0, 0.2);
}

.cdc-block-wrapper.chunk-hover .cdc-block-label {
  color: #3d3a36;
  font-weight: 600;
}

.cdc-text-view .chunk.chunk-hover {
  filter: brightness(0.9);
  outline: 1px solid rgba(61, 58, 54, 0.3);
  outline-offset: -1px;
}

/* Gear Lookup Table grid */
.gear-table-grid {
  display: grid;
  grid-template-columns: repeat(16, 1fr);
  gap: 1px;
  margin-top: 0.5rem;
}

.gear-table-cell {
  height: 15px;
  border-radius: 1px;
  cursor: pointer;
  transition: transform 0.1s ease, box-shadow 0.15s ease;
  position: relative;
}

.gear-table-cell:hover {
  transform: scale(1.4);
  z-index: 2;
  box-shadow: 0 0 4px rgba(0,0,0,0.2);
}

.gear-table-cell.active {
  outline: 2px solid #c45a3b;
  outline-offset: 0px;
  box-shadow: 0 0 6px rgba(196, 90, 59, 0.5);
  z-index: 3;
  transform: scale(1.4);
}

.gear-table-readout {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.8rem;
  color: #8b7355;
  min-height: 1.4em;
}

.gear-table-readout strong {
  color: #c45a3b;
}

/* Rolling hash window strip */
.gear-hash-window {
  display: flex;
  gap: 1px;
  overflow-x: auto;
  padding: 0.25rem 0;
  margin-bottom: 0.5rem;
  min-height: 3.2rem;
}

.gear-hw-cell {
  display: flex;
  flex-direction: column;
  align-items: center;
  min-width: 2rem;
  padding: 0.2rem 0.15rem;
  border-radius: 3px;
  background: rgba(61, 58, 54, 0.04);
  transition: background-color 0.15s ease;
}

.gear-hw-cell.current {
  outline: 2px solid #c45a3b;
  outline-offset: -1px;
}

.gear-hw-cell.boundary {
  border-right: 2px solid #2a7d4f;
}

.gear-hw-char {
  font-family: 'Source Serif 4', Georgia, serif;
  font-size: 0.8rem;
  color: #3d3a36;
  line-height: 1.2;
}

.gear-hw-hash {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.5rem;
  color: #8b7355;
  line-height: 1;
  margin-top: 2px;
}

.gear-hw-hash.boundary {
  color: #2a7d4f;
  font-weight: 700;
}

/* Bit-shift visualization */
.gear-shift-viz {
  margin-bottom: 0.5rem;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
}

.gear-shift-row {
  display: flex;
  align-items: center;
  gap: 0.3rem;
  margin-bottom: 3px;
}

.gear-shift-label {
  width: 4rem;
  text-align: right;
  color: #8b7355;
  font-size: 0.6rem;
  flex-shrink: 0;
}

.gear-shift-hex {
  width: 5.5rem;
  text-align: right;
  color: #3d3a36;
  font-size: 0.6rem;
  flex-shrink: 0;
  padding-right: 0.3rem;
}

.gear-shift-bits {
  display: flex;
  gap: 0;
  position: relative;
}

.gear-bit {
  width: 7px;
  height: 14px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.45rem;
  line-height: 1;
  border-radius: 1px;
}

.gear-bit.b0,
.gear-bit.b1 {
  background: rgba(61, 58, 54, 0.06);
  color: #3d3a36;
}

.gear-bit.dropped {
  background: rgba(196, 90, 59, 0.25);
  color: #c45a3b;
  text-decoration: line-through;
}

.gear-bit.entering {
  background: rgba(90, 138, 90, 0.25);
  color: #5a8a5a;
  font-weight: 700;
}

@keyframes gear-slide-left {
  0% { transform: translateX(7px); opacity: 0.5; }
  100% { transform: translateX(0); opacity: 1; }
}

.gear-shift-bits.animated .gear-bit {
  animation: gear-slide-left 0.25s ease-out;
}

.gear-shift-box {
  border: 1.5px solid rgba(196, 90, 59, 0.3);
  border-radius: 6px;
  padding: 0.4rem 0.5rem;
  background: rgba(196, 90, 59, 0.02);
}

.gear-shift-connector {
  text-align: center;
  color: #8b7355;
  font-size: 0.7rem;
  line-height: 1;
  padding: 0.15rem 0;
}

.gear-shift-add {
  border: 1.5px solid rgba(61, 58, 54, 0.12);
  border-radius: 6px;
  padding: 0.4rem 0.5rem;
  background: rgba(61, 58, 54, 0.02);
}

.gear-shift-separator {
  width: calc(32 * 7px);
  border-top: 1px solid rgba(61, 58, 54, 0.15);
  margin: 2px 0;
}

/* Two-column layout: Operation panel + Gear table */
.gear-two-col {
  display: flex;
  gap: 1.5rem;
  margin-top: 1rem;
  align-items: flex-start;
}

.gear-col-left {
  flex: 1 1 0;
  min-width: 0;
}

.gear-col-right {
  flex: 1 1 0;
  min-width: 0;
}

/* Chunk boundary marker (vertical separator) */
.chunk-boundary-marker {
  display: inline-block;
  width: 2px;
  height: 1.2em;
  background: #c45a3b;
  margin: 0 2px;
  vertical-align: middle;
  border-radius: 1px;
  opacity: 0.6;
}

.chunk-label {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #8b7355;
  background: rgba(61, 58, 54, 0.06);
  padding: 0.1rem 0.3rem;
  border-radius: 2px;
  margin-right: 2px;
  vertical-align: top;
  line-height: 1;
  user-select: none;
}

/* Parametric Chunking Explorer - distribution chart */
.parametric-distribution-chart {
  position: relative;
  display: flex;
  align-items: flex-end;
  gap: 2px;
  height: 120px;
  padding: 0.5rem 0;
  margin-bottom: 1rem;
}

.parametric-dist-bar {
  flex: 1;
  min-width: 3px;
  border-radius: 2px 2px 0 0;
  transition: height 0.2s ease;
  cursor: pointer;
  position: relative;
}

.parametric-dist-bar:hover { opacity: 0.8; }

.parametric-dist-tooltip {
  display: none;
  position: absolute;
  bottom: 100%;
  left: 50%;
  transform: translateX(-50%);
  background: #3d3a36;
  color: #fff;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.7rem;
  padding: 0.25rem 0.5rem;
  border-radius: 4px;
  white-space: nowrap;
  pointer-events: none;
  margin-bottom: 4px;
  z-index: 10;
}

.parametric-dist-bar:hover .parametric-dist-tooltip { display: block; }

.parametric-dist-reference {
  position: absolute;
  left: 0;
  right: 0;
  border-top: 2px dashed rgba(196, 90, 59, 0.5);
  pointer-events: none;
  z-index: 1;
}

.parametric-dist-reference-label {
  position: absolute;
  right: 0;
  top: -1.1rem;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.6rem;
  color: #c45a3b;
  white-space: nowrap;
}

.parametric-dist-bar.chunk-hover {
  filter: brightness(1.15);
  box-shadow: 0 0 6px rgba(0, 0, 0, 0.2);
}

.parametric-derived-params {
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', 'Fira Mono', monospace;
  font-size: 0.75rem;
  color: #8b7355;
  white-space: nowrap;
}

/* Mobile responsive */
@media (max-width: 42em) {
  .cdc-controls {
    grid-template-columns: 1fr;
  }

  .cdc-dedup-files {
    grid-template-columns: 1fr;
  }

  .cdc-dedup-arrow {
    transform: rotate(90deg);
    padding: 1rem 0;
  }

  .cdc-chunk-summary {
    grid-template-columns: repeat(2, 1fr);
  }

  .cdc-hex-view {
    font-size: 0.65rem;
  }

  .gear-two-col {
    flex-direction: column;
  }

  .gear-table-grid {
    gap: 0px;
  }

  .gear-table-cell {
    border-radius: 0;
    height: 12px;
  }
}

/* ── Algorithmic Timeline ──────────────────────── */
.cdc-timeline {
  --line-x: 20px;
  --line-color: rgba(61, 58, 54, 0.25);
  --dot-size: 14px;
  position: relative;
  padding: 1rem 0 1rem 0;
  margin-left: 0.5rem;
}

/* Continuous vertical line */
.cdc-timeline::before {
  content: '';
  position: absolute;
  left: var(--line-x);
  top: 0;
  bottom: 0;
  width: 2px;
  background: var(--line-color);
}

/* ── Year markers ─────────────────────────────── */
.cdc-tl-marker {
  position: relative;
  padding: 0.6rem 0 0.6rem calc(var(--line-x) + 20px);
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.85rem;
  font-weight: 600;
  color: #a89b8c;
  letter-spacing: 0.03em;
}
/* Tick mark on the line */
.cdc-tl-marker::before {
  content: '';
  position: absolute;
  left: calc(var(--line-x) - 4px);
  top: 50%;
  width: 10px;
  height: 2px;
  background: var(--line-color);
  transform: translateY(-50%);
}

/* ── Entry (dot + card) ───────────────────────── */
.cdc-tl-entry {
  position: relative;
  display: flex;
  align-items: flex-start;
  padding: 0.5rem 0;
  padding-left: calc(var(--line-x) + 20px);
}

/* Dot */
.cdc-tl-dot {
  position: absolute;
  left: var(--line-x);
  top: 1rem;
  width: var(--dot-size);
  height: var(--dot-size);
  border-radius: 50%;
  transform: translateX(-50%);
  z-index: 2;
  border: 2px solid #fff;
  box-shadow: 0 0 0 2px var(--line-color);
  flex-shrink: 0;
}
.cdc-tl-dot.bsw { background: #c45a3b; }
.cdc-tl-dot.extrema { background: #2a7d4f; }
.cdc-tl-dot.statistical { background: #8b7355; }

/* Card */
.cdc-tl-card {
  background: rgba(61, 58, 54, 0.03);
  border: 1px solid rgba(61, 58, 54, 0.08);
  border-radius: 8px;
  padding: 0.75rem 1rem;
  width: 100%;
  min-width: 0;
}

/* Tab navigation */
.cdc-tl-tabs {
  margin-top: 0.5rem;
}
.cdc-tl-tabs input[type="radio"] {
  display: none;
}
.cdc-tl-tabs label {
  display: inline-block;
  padding: 0.35rem 0.65rem;
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.75rem;
  font-weight: 600;
  color: #a89b8c;
  cursor: pointer;
  border-bottom: 2px solid transparent;
  margin-right: 0.15rem;
  transition: color 0.2s ease, border-color 0.2s ease;
  letter-spacing: 0.02em;
}
.cdc-tl-tabs label:hover {
  color: #5a5550;
}
.cdc-tl-tab-content {
  margin-top: 0.4rem;
  border-top: 1px solid rgba(61, 58, 54, 0.08);
  padding-top: 0.5rem;
  height: 14rem;
  overflow-y: auto;
}
.cdc-tl-tab-panel {
  display: none;
}
.cdc-tl-tabs input:nth-of-type(1):checked ~ .cdc-tl-tab-content .cdc-tl-tab-panel:nth-child(1),
.cdc-tl-tabs input:nth-of-type(2):checked ~ .cdc-tl-tab-content .cdc-tl-tab-panel:nth-child(2),
.cdc-tl-tabs input:nth-of-type(3):checked ~ .cdc-tl-tab-content .cdc-tl-tab-panel:nth-child(3),
.cdc-tl-tabs input:nth-of-type(4):checked ~ .cdc-tl-tab-content .cdc-tl-tab-panel:nth-child(4) {
  display: block;
}
.cdc-tl-tabs input:nth-of-type(1):checked ~ label:nth-of-type(1),
.cdc-tl-tabs input:nth-of-type(2):checked ~ label:nth-of-type(2),
.cdc-tl-tabs input:nth-of-type(3):checked ~ label:nth-of-type(3),
.cdc-tl-tabs input:nth-of-type(4):checked ~ label:nth-of-type(4) {
  color: #c45a3b;
  border-bottom-color: #c45a3b;
}

/* Card header: name left, family badge right */
.cdc-tl-header {
  display: flex;
  justify-content: space-between;
  align-items: flex-start;
  gap: 0.5rem;
}

.cdc-tl-card .cdc-tax-family-label {
  font-size: 0.75rem;
  padding: 0.2rem 0.5rem;
  display: inline-block;
  flex-shrink: 0;
  margin-top: 0.15rem;
}

.cdc-tl-name {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 1.15em;
  font-weight: 700;
  color: #3d3a36;
  line-height: 1.3;
}

.cdc-tl-year {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem;
  font-weight: 700;
  color: #a89b8c;
  margin-bottom: 0.4rem;
}

.cdc-tl-desc {
  font-size: inherit;
  color: #5a5550;
  line-height: 1.7em;
}

.cdc-tl-card .highlight {
  margin: 0.3rem 0 0 0;
  border-radius: 5px;
  border: 1px solid rgba(61, 58, 54, 0.08);
  overflow-x: auto;
}
.cdc-tl-card .highlight pre {
  line-height: 1.5;
  margin: 0;
}
</style>

<!-- MathJax for rendering mathematical notation -->
<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$']],
    displayMath: [['$$', '$$'], ['\\[', '\\]']]
  }
};
</script>

<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" async=""></script>

<div class="cdc-series-nav">
Part 1 of 5 in a series on Content-Defined Chunking. Next: <a href="/writings/content-defined-chunking-part-2">Part 2: A Deep Dive into FastCDC</a>
</div>

<p>Have you ever considered what it takes to store large amounts of user content at scale? Especially content like files or source code, where multiple versions of the same document exist with only minor changes between them. It’s easy to take document storage for granted, but storing content at scale efficiently is a surprisingly nuanced problem. While there are numerous aspects to this problem space, this series focuses on one in particular: deduplication. How do you avoid storing the same unchanged bytes of a file over and over again as that file evolves? Ideally, a storage system would only store the minimum set of unique bytes. Is that even possible? This series will help answer that question, and more.</p>

<p>What is deduplication? At its heart, it is the separation of content that changed from content that has not, with varying levels of precision and accuracy. But how do you identify what changed? That’s where the content-defined chunking (CDC) family of algorithms offers help. These algorithms share a common goal: splitting a file into smaller chunks at byte boundaries determined by the content’s structure.</p>

<p>Ideally, a small edit to a file should minimize the impact to the surrounding chunks whose content has not changed, resulting in new chunks that capture the edit while preserving existing chunks. Diverse content brings unique structure, and not all content can be chunked the same way. As we will see, each CDC algorithm and its approach affords benefits and tradeoffs.</p>

<p>This series walks you through an overview of all CDC algorithms, grounded by their underlying 40 years of CDC research. Next, you’ll deep dive into a popular and general-purpose CDC algorithm, FastCDC, to deeply understand how it works and what makes it so fast. To better contextualize how CDC works in production systems, you’ll walk through a general deduplication pipeline. This sets the stage to begin understanding CDC’s costs and tradeoffs, what factors dominate deduplication system costs at scale, and how to navigate cloud storage decisions.</p>

<p>Along the way, there are several interactive visualizations provided to help build mental models for the algorithms, their constraints, and costs associated with deduplication systems. You can step through how FastCDC identifies chunk boundaries, see the impact of normalized chunking, and even see how chunk size impacts storage pressure while making edits to a file. As the discussion focuses more on costs, the visualizations allow for manipulating container capacities and cache hit rates, all contextualized within the pricing schemes of popular cloud object storage and cache providers, making it easy to see the severe impacts levers like chunk size and container size have on operating costs for deduplication systems at scale.</p>

<div class="cdc-toc">
  <strong>Contents</strong>
  <ol>
    <li>
      <a href="#motivating-the-problem">Motivating the Problem</a>
      <ul>
        <li><a href="#why-not-just-use-fixed-size-chunks">Why Not Just Use Fixed-Size Chunks?</a></li>
        <li><a href="#the-core-idea-content-as-the-arbiter">The Core Idea: Content as the Arbiter</a></li>
      </ul>
    </li>
    <li>
      <a href="#three-families-of-cdc">Three Families of CDC</a>
      <ul>
        <li><a href="#origins">Origins</a></li>
        <li><a href="#a-taxonomy-of-cdc-algorithms">A Taxonomy of CDC Algorithms</a></li>
        <li><a href="#algorithmic-timeline">Algorithmic Timeline</a></li>
      </ul>
    </li>
    <li>
      <a href="/writings/content-defined-chunking-part-2#a-closer-look-at-bsw-via-fastcdc">A Closer Look at BSW via FastCDC</a> <em style="font-size: 0.78rem; color: #a89b8c;">(Part 2)</em>
      <ul>
        <li><a href="/writings/content-defined-chunking-part-2#the-gear-hash">The Gear Hash</a></li>
        <li><a href="/writings/content-defined-chunking-part-2#finding-chunk-boundaries">Finding Chunk Boundaries</a></li>
        <li><a href="/writings/content-defined-chunking-part-2#the-2016-algorithm">The 2016 Algorithm</a></li>
        <li><a href="/writings/content-defined-chunking-part-2#the-2020-enhancement-rolling-two-bytes">The 2020 Enhancement: Rolling Two Bytes</a></li>
        <li><a href="/writings/content-defined-chunking-part-2#exploring-the-parameters">Exploring the Parameters</a></li>
      </ul>
    </li>
    <li>
      <a href="/writings/content-defined-chunking-part-3#deduplication-in-action">Deduplication in Action</a> <em style="font-size: 0.78rem; color: #a89b8c;">(Part 3)</em>
      <ul>
        <li><a href="/writings/content-defined-chunking-part-3#the-deduplication-pipeline">The Deduplication Pipeline</a></li>
        <li><a href="/writings/content-defined-chunking-part-3#the-core-cost-tradeoffs">The Core Cost Tradeoffs</a></li>
        <li><a href="/writings/content-defined-chunking-part-3#where-cdc-lives-today">Where CDC Lives Today</a></li>
        <li><a href="/writings/content-defined-chunking-part-3#when-cdc-is-not-the-right-choice">When CDC Is Not the Right Choice</a></li>
        <li><a href="/writings/content-defined-chunking-part-3#why-cloud-storage-is-the-cost-that-matters">Why Cloud Storage is the Cost that Matters</a></li>
      </ul>
    </li>
    <li>
      <a href="/writings/content-defined-chunking-part-4">CDC in the Cloud</a> <em style="font-size: 0.78rem; color: #a89b8c;">(Part 4)</em>
      <ul>
        <li><a href="/writings/content-defined-chunking-part-4#the-cloud-cost-problem">The Cloud Cost Problem</a></li>
        <li><a href="/writings/content-defined-chunking-part-4#reducing-costs-through-containers">Reducing Costs through Containers</a></li>
        <li><a href="/writings/content-defined-chunking-part-4#the-impact-of-containers-on-cost">The Impact of Containers on Cost</a></li>
        <li><a href="/writings/content-defined-chunking-part-4#more-containers-more-problems">More Containers More Problems</a></li>
        <li><a href="/writings/content-defined-chunking-part-4#the-fragmentation-problem">The Fragmentation Problem</a></li>
        <li><a href="/writings/content-defined-chunking-part-4#garbage-collection">Garbage Collection</a></li>
        <li><a href="/writings/content-defined-chunking-part-4#container-size-as-the-primary-lever">Container Size as the Primary Lever</a></li>
      </ul>
    </li>
    <li>
      <a href="/writings/content-defined-chunking-part-5">CDC at Scale on a Budget</a> <em style="font-size: 0.78rem; color: #a89b8c;">(Part 5)</em>
      <ul>
        <li><a href="/writings/content-defined-chunking-part-5#the-cost-comparison-continued">The Cost Comparison Continued</a></li>
        <li><a href="/writings/content-defined-chunking-part-5#reducing-costs-through-caching">Reducing Costs through Caching</a></li>
        <li><a href="/writings/content-defined-chunking-part-5#all-costs-considered">All Costs Considered</a></li>
        <li><a href="/writings/content-defined-chunking-part-5#why-i-care-about-this">Why I Care About This</a></li>
        <li><a href="/writings/content-defined-chunking-part-5#conclusion">Conclusion</a></li>
      </ul>
    </li>
    <li><a href="/references/cdc">References</a></li>
    <li>
      <a href="/sandbox/cdc">Visualizations</a>
      <ul>
        <li><a href="/sandbox/cdc#fixed-chunking-demo">Fixed-Size Chunking</a></li>
        <li><a href="/sandbox/cdc#cdc-chunking-demo">Content-Defined Chunking</a></li>
        <li><a href="/sandbox/cdc#gear-hash-demo">Gear Hash in Action</a></li>
        <li><a href="/sandbox/cdc#parametric-demo">Parametric Chunking Explorer</a></li>
        <li><a href="/sandbox/cdc#comparison-demo">Basic vs Normalized Chunk Size Distribution</a></li>
        <li><a href="/sandbox/cdc#dedup-demo">Deduplication Explorer</a></li>
        <li><a href="/sandbox/cdc#cost-tradeoffs-demo">Cost Tradeoffs Explorer</a></li>
        <li><a href="/sandbox/cdc#naive-cost-demo">Established Object Storage Provider Cost Explorer</a></li>
        <li><a href="/sandbox/cdc#packed-cost-demo">Established Object Storage Provider Cost Explorer with Containers</a></li>
        <li><a href="/sandbox/cdc#newcomer-cost-demo">Challenger Object Storage Provider Cost Explorer</a></li>
        <li><a href="/sandbox/cdc#provider-comparison-demo">Established vs. Challenger Object Storage Provider Cost Comparison</a></li>
        <li><a href="/sandbox/cdc#zipf-distribution-demo">Zipf Popularity Distribution</a></li>
        <li><a href="/sandbox/cdc#zipf-cache-demo">Cache Size vs. Hit Rate</a></li>
        <li><a href="/sandbox/cdc#cache-traditional-demo">Established Cache Provider Cost Explorer</a></li>
        <li><a href="/sandbox/cdc#cache-newcomer-demo">Challenger Cache Provider Cost Explorer</a></li>
        <li><a href="/sandbox/cdc#comprehensive-cost-demo">Comprehensive Cost Model</a></li>
      </ul>
    </li>
  </ol>
</div>

<hr />

<h2 id="motivating-the-problem">Motivating the Problem</h2>

<p>Imagine you’re building a backup system. A user stores a 500MB file, then modifies a single paragraph and saves it again. In a naive system, this results in two nearly identical copies of the same file. Despite the small change of a single paragraph, the storage system grew from 500MB to 1GB. Surely we can do better.</p>

<p>This is the <strong>deduplication problem</strong>, and it shows up in many familiar places: cloud blob storage providers managing petabytes of user files (e.g. <a href="https://aws.amazon.com/s3/">Amazon S3</a> or <a href="https://azure.microsoft.com/en-us/products/storage/blobs">Azure Blob Storage</a>), cloud file servers like Google Drive or iCloud, and software backup tools like <a href="https://restic.net/">Restic</a> and <a href="https://www.borgbackup.org/">Borg</a>.</p>

<p>The simplest form of deduplication is whole-file comparison: hash the entire file, and if two files produce the same hash, store only one copy. This works well for exact duplicates, but falls apart with even the smallest edit. Change a single byte and the hash changes completely, so the system treats the original and the edited version as two entirely different files.</p>

<p>One fix is to reduce the granularity of comparison. Instead of hashing a file as a single unit, split it into smaller segments called chunks and hash each chunk independently. A small edit now only affects the chunks near the change, leaving the rest unchanged. Those unchanged chunks can be recognized as duplicates and stored only once. The question then becomes: how should we decide where to split?</p>

<h3 id="why-not-just-use-fixed-size-chunks">Why Not Just Use Fixed-Size Chunks?</h3>

<p>The naive approach to chunking is fixed-size splitting: choose a chunk size, say 4KB, and split the file at every 4KB boundary. A 1MB file becomes 256 chunks of 4KB each. This approach is conceptually simple, but is problematic if we want to prevent <strong>change amplification</strong> – the invalidation of unchanged chunks when small edits occur. Using this naive chunking strategy, let’s see what happens to unchanged chunks when a small edit occurs at the beginning of a file:</p>

<div class="cdc-comparison-panel" id="fixed-chunking-demo" style="margin: 2rem 0;">
  <div class="cdc-comparison-title">Fixed-Size Chunking (48 bytes)</div>
  <!-- Populated dynamically by ChunkComparisonDemo -->
</div>

<p>Inserting “NEW INTRO.” at the beginning of the file causes every chunk boundary to shift, invalidating all five original chunks. The result is five new chunks and zero unchanged chunks, producing a deduplication ratio of 0%. In practice, this means the entire file would need to be stored again, even though most of its content did not change. We need a chunking strategy whose boundaries are not determined by fixed byte offsets, and that offers more flexibility to identify split points that better preserve unchanged chunks.</p>

<h3 id="the-core-idea-content-as-the-arbiter">The Core Idea: Content as the Arbiter</h3>

<p>Instead of using fixed-length byte windows to split a file into chunks, what if we could use patterns or structure in the file’s content to identify chunk boundaries? This is the core problem a family of algorithms known as content-defined chunking (CDC) attempts to solve.</p>

<p>How does CDC decide where to split? The details vary across algorithms, but the core principle is the same: examine a small region of data at each position, and declare a boundary when the content at that position satisfies some condition. Different algorithms use different strategies for this. Some compute a hash of a sliding window, some look for local extrema in the byte values, and some use statistical properties of the data. What they all share is that the boundary decision, or split point, is dependent on the content itself.</p>

<p>Let’s revisit the same example from before, but this time we split the text at sentence boundaries. Each sentence ending (a period, exclamation mark, or question mark followed by a space) defines a chunk boundary. Because the boundary is determined by the content itself, not by a fixed byte count, inserting text at the beginning of the file does not invalidate existing unchanged chunks.</p>

<div class="cdc-comparison-panel" id="cdc-chunking-demo" style="margin: 2rem 0;">
  <div class="cdc-comparison-title">Content-Defined Chunking (sentence boundaries)</div>
  <!-- Populated dynamically by ChunkComparisonDemo -->
</div>

<p>Inserting “NEW INTRO.” creates just one new chunk. The original five sentences are unchanged, so their chunks are identical to before. The result is a much higher deduplication ratio, meaning we only need to store the new chunk and can reference the existing chunks for the rest of the file.</p>

<div class="cdc-callout" data-label="Key Insight">
When chunk boundaries are defined by the content itself rather than by fixed byte offsets, a small edit only affects the chunks near the change. The rest of the file's chunks remain identical and can be deduplicated.
</div>

<hr />

<h2 id="three-families-of-cdc">Three Families of CDC</h2>

<h3 id="origins">Origins</h3>

<p>The story begins with Turing Award winner <strong>Michael Rabin</strong>, who introduced polynomial fingerprinting in 1981.<span class="cdc-cite"><a href="#ref-1">[1]</a></span> His key insight: represent a sequence of bytes as a polynomial and evaluate it at a random point to get a “fingerprint” that uniquely identifies the content with high probability. More importantly, this fingerprint could be computed <em>incrementally</em> as a <strong>rolling hash</strong>, making it efficient to slide across data.</p>

<p>For a sequence of bytes $b_0, b_1, \ldots, b_{n-1}$, the fingerprint is:</p>

\[f(x) = b_0 + b_1 \cdot x + b_2 \cdot x^2 + \ldots + b_{n-1} \cdot x^{n-1} \mod p\]

<p>where $p$ is an irreducible polynomial over $GF(2)$.</p>

<div class="cdc-learn-more">
Ask your AI assistant about "Galois fields" and "polynomial arithmetic in GF(2)" to understand the mathematical foundations.
</div>

<p>Twenty years later, the <strong>Low-Bandwidth File System</strong> (LBFS) at MIT became the first major system to use CDC in practice.<span class="cdc-cite"><a href="#ref-2">[2]</a></span> LBFS slid a 48-byte window across the data and computed a Rabin fingerprint at each position. Whenever the low 13 bits of the fingerprint equaled a magic constant, it declared a chunk boundary, producing an average chunk size of about 8KB. The breakthrough was demonstrating that CDC could achieve dramatic bandwidth savings for real file workloads. Modifying a single paragraph in a large document transmitted only the changed chunk, not the entire file.</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="c1">// Simplified LBFS boundary check</span>
<span class="k">if</span> <span class="p">((</span><span class="n">fingerprint</span> <span class="o">%</span> <span class="mi">8192</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x78</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// This is a chunk boundary</span>
    <span class="n">emit_chunk</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">current_position</span><span class="p">);</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">current_position</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>The deduplication era of 2005-2015 drove an explosion of CDC research. Many successful systems built deduplication pipelines using CDC based on advances in research that produced faster hash functions, better chunk size distributions, and new ways of finding chunk boundaries. By the mid-2010s, what had been a single technique branched into a family of algorithms with fundamentally different strategies.</p>
<h3 id="a-taxonomy-of-cdc-algorithms">A Taxonomy of CDC Algorithms</h3>

<p>A comprehensive 2024 survey by Gregoriadis et al.<span class="cdc-cite"><a href="#ref-12">[12]</a></span> organizes the CDC landscape into <strong>three distinct families</strong> based on their core mechanism for finding chunk boundaries. This taxonomy clarifies a field that can otherwise feel like a confusing proliferation of acronyms.</p>

<div class="cdc-taxonomy-table">
  <table>
    <thead>
      <tr>
        <th></th>
        <th class="bsw">BSW (Basic Sliding Window)</th>
        <th class="extrema">Local Extrema</th>
        <th class="statistical">Statistical</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td class="row-label">Algorithms</td>
        <td>Rabin <span class="algo-year">1981</span>, Buzhash <span class="algo-year">1997</span>, Gear <span class="algo-year">2014</span>, FastCDC <span class="algo-year">2016</span>, PCI <span class="algo-year">2020</span></td>
        <td>AE <span class="algo-year">2015</span>, RAM <span class="algo-year">2017</span>, MII <span class="algo-year">2019</span>, VectorCDC <span class="algo-year">2025</span></td>
        <td>BFBC <span class="algo-year">2020</span></td>
      </tr>
      <tr>
        <td class="row-label">Core operation</td>
        <td>Rolling hash + mask</td>
        <td>Byte comparisons</td>
        <td>Frequency analysis</td>
      </tr>
      <tr>
        <td class="row-label">Throughput</td>
        <td>Medium&ndash;High</td>
        <td>High</td>
        <td>Medium</td>
      </tr>
      <tr>
        <td class="row-label">Dedup ratio</td>
        <td>High</td>
        <td>Comparable</td>
        <td>Dataset-dependent</td>
      </tr>
      <tr>
        <td class="row-label">SIMD-friendly</td>
        <td>Limited</td>
        <td>Excellent</td>
        <td>Limited</td>
      </tr>
      <tr>
        <td class="row-label">Streaming</td>
        <td>Yes</td>
        <td>Yes</td>
        <td>No (pre-scan)</td>
      </tr>
      <tr>
        <td class="row-label">Chunk distribution</td>
        <td>Exponential (improved with NC)</td>
        <td>Varies</td>
        <td>Varies</td>
      </tr>
      <tr>
        <td class="row-label">Used in practice</td>
        <td>Restic, Borg, FastCDC</td>
        <td>Research</td>
        <td>Research</td>
      </tr>
    </tbody>
  </table>
  <div class="cdc-taxonomy-table-note">Taxonomy from Gregoriadis et al. <a href="#ref-12">[12]</a></div>
</div>

<h3 id="algorithmic-timeline">Algorithmic Timeline</h3>

<div class="cdc-timeline">

  <div class="cdc-tl-marker">1980</div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot bsw"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">Rabin Fingerprint<span class="cdc-cite"><a href="#ref-1">[1]</a></span></div>
        <span class="cdc-tax-family-label bsw">BSW</span>
      </div>
      <div class="cdc-tl-year">1981</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-rabin" id="tab-rabin-1" checked="" />
        <label for="tab-rabin-1">Overview</label>
        <input type="radio" name="tab-rabin" id="tab-rabin-2" />
        <label for="tab-rabin-2">Application</label>
        <input type="radio" name="tab-rabin" id="tab-rabin-3" />
        <label for="tab-rabin-3">Code</label>
        <input type="radio" name="tab-rabin" id="tab-rabin-4" />
        <label for="tab-rabin-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">The foundational rolling hash for CDC. Rabin's fingerprint operates over <em>GF(2)</em> (the Galois field with two elements) where all arithmetic reduces to XOR and carry-less multiplication. The key insight: the hash of a sliding window can be updated in <em>O(1)</em> by removing the outgoing byte's contribution and adding the incoming byte's, without recomputing from scratch. This was the first practical rolling hash with provable uniformity: the probability of two distinct <em>k</em>-byte strings colliding is at most <em>k/2<sup>d</sup></em> for an irreducible polynomial of degree <em>d</em>. The polynomial arithmetic makes it slower than later alternatives, but its mathematical foundation remains unmatched.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Used in <strong>Restic</strong> backup and <strong>LBFS</strong>.<span class="cdc-cite"><a href="#ref-2">[2]</a></span> Restic generates a random irreducible polynomial of degree 53 per repository, so that chunk boundaries cannot be predicted from known content. Rabin's provable collision bound -- at most <em>k/2<sup>d</sup></em> for an irreducible polynomial of degree <em>d</em> -- makes it the choice when formal hash uniformity guarantees are needed.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="c1">// Rabin fingerprint: rolling hash over GF(2)</span>
<span class="kt">uint64_t</span> <span class="n">fp</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">fp</span> <span class="o">^=</span> <span class="n">shift_table</span><span class="p">[</span><span class="n">window</span><span class="p">[</span><span class="n">i</span> <span class="o">%</span> <span class="n">w</span><span class="p">]];</span>  <span class="c1">// remove outgoing byte</span>
    <span class="n">fp</span> <span class="o">=</span> <span class="p">(</span><span class="n">fp</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>          <span class="c1">// shift in new byte</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fp</span> <span class="o">&amp;</span> <span class="n">HIGH_BIT</span><span class="p">)</span> <span class="n">fp</span> <span class="o">^=</span> <span class="n">poly</span><span class="p">;</span>     <span class="c1">// reduce mod irreducible poly</span>
    <span class="n">window</span><span class="p">[</span><span class="n">i</span> <span class="o">%</span> <span class="n">w</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="k">if</span> <span class="p">((</span><span class="n">fp</span> <span class="o">%</span> <span class="n">D</span><span class="p">)</span> <span class="o">==</span> <span class="n">r</span><span class="p">)</span> <span class="k">return</span> <span class="n">i</span><span class="p">;</span>       <span class="c1">// boundary!</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(1)</em> per byte (one XOR to remove, one shift + XOR to add, one polynomial reduction).</div><div><strong>Space:</strong> <em>O(w + 256)</em>, sliding window buffer plus a precomputed byte-shift table.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-marker">1985</div>
  <div class="cdc-tl-marker">1990</div>
  <div class="cdc-tl-marker">1995</div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot bsw"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">Buzhash<span class="cdc-cite"><a href="#ref-3">[3]</a></span></div>
        <span class="cdc-tax-family-label bsw">BSW</span>
      </div>
      <div class="cdc-tl-year">1997</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-buzhash" id="tab-buzhash-1" checked="" />
        <label for="tab-buzhash-1">Overview</label>
        <input type="radio" name="tab-buzhash" id="tab-buzhash-2" />
        <label for="tab-buzhash-2">Application</label>
        <input type="radio" name="tab-buzhash" id="tab-buzhash-3" />
        <label for="tab-buzhash-3">Code</label>
        <input type="radio" name="tab-buzhash" id="tab-buzhash-4" />
        <label for="tab-buzhash-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Replaces Rabin's polynomial division with a <strong>cyclic polynomial</strong> where each byte maps to a random value via a lookup table, and the hash is maintained by cyclically rotating (barrel shifting) the current value and XORing in the new byte's table entry. Removing the outgoing byte uses the same table but rotated by the window size. This eliminates the polynomial reduction step entirely: no multiplication, just rotations and XORs. The result is significantly faster than Rabin in practice while providing comparable distribution properties for boundary detection.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Used in <strong>Borg</strong> backup, which XORs the lookup table with an encrypted per-repository seed to prevent chunk-boundary fingerprinting attacks. Borg notes that Buzhash is used only for boundary detection; a separate cryptographic MAC serves as the deduplication hash. The combination of simple arithmetic (rotations and XORs) with seed-based table randomization makes Buzhash a practical choice when both throughput and boundary-prediction resistance are needed.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="c1">// Buzhash: cyclic polynomial rolling hash</span>
<span class="kt">uint32_t</span> <span class="n">table</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>                            <span class="c1">// random values, initialized once</span>

<span class="kt">uint32_t</span> <span class="n">h</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">ROTATE_LEFT</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>                      <span class="c1">// cyclic shift by 1</span>
    <span class="n">h</span> <span class="o">^=</span> <span class="n">ROTATE_LEFT</span><span class="p">(</span><span class="n">table</span><span class="p">[</span><span class="n">window</span><span class="p">[</span><span class="n">i</span> <span class="o">%</span> <span class="n">w</span><span class="p">]],</span> <span class="n">w</span><span class="p">);</span>  <span class="c1">// remove outgoing</span>
    <span class="n">h</span> <span class="o">^=</span> <span class="n">table</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>                        <span class="c1">// add incoming</span>
    <span class="n">window</span><span class="p">[</span><span class="n">i</span> <span class="o">%</span> <span class="n">w</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="k">if</span> <span class="p">((</span><span class="n">h</span> <span class="o">%</span> <span class="n">D</span><span class="p">)</span> <span class="o">==</span> <span class="n">r</span><span class="p">)</span> <span class="k">return</span> <span class="n">i</span><span class="p">;</span>                 <span class="c1">// boundary!</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(1)</em> per byte, consisting of one table lookup, one rotate, and two XORs.</div><div><strong>Space:</strong> <em>O(w + 256)</em>, window buffer plus the random lookup table.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-marker">2000</div>
  <div class="cdc-tl-marker">2005</div>
  <div class="cdc-tl-marker">2010</div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot bsw"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">Gear<span class="cdc-cite"><a href="#ref-4">[4]</a></span></div>
        <span class="cdc-tax-family-label bsw">BSW</span>
      </div>
      <div class="cdc-tl-year">2014</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-gear" id="tab-gear-1" checked="" />
        <label for="tab-gear-1">Overview</label>
        <input type="radio" name="tab-gear" id="tab-gear-2" />
        <label for="tab-gear-2">Application</label>
        <input type="radio" name="tab-gear" id="tab-gear-3" />
        <label for="tab-gear-3">Code</label>
        <input type="radio" name="tab-gear" id="tab-gear-4" />
        <label for="tab-gear-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Radically simplifies the rolling hash by eliminating the sliding window entirely. There is no outgoing byte to remove because the hash is purely feedforward. Each step left-shifts the hash by 1 bit and adds a random table lookup for the incoming byte: <code>hash = (hash &lt;&lt; 1) + table[byte]</code>. Since older bits naturally shift out of a 64-bit register, the hash is dominated by the most recent ~64 bytes. The insight is that for CDC purposes, you don't need a true sliding window hash; an approximate one where old bytes decay away is sufficient, since boundary decisions are local. One shift + one add gives the tightest inner loop of any CDC hash, roughly 2-3&times; faster than Buzhash.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Originally the chunking engine within the <strong>Ddelta</strong> delta compression system,<span class="cdc-cite"><a href="#ref-4">[4]</a></span> where it demonstrated 2&times; throughput over Rabin by cutting more than half the per-byte operations. Later adopted by <strong>FastCDC</strong> as its core hash.<span class="cdc-cite"><a href="#ref-5">[5]</a></span> The tight inner loop (one shift, one add) also makes Gear straightforward to implement and audit.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="c1">// Gear hash: feedforward, no window needed</span>
<span class="kt">uint64_t</span> <span class="n">gear_table</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span> <span class="c1">// random 64-bit values</span>

<span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">min_size</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">gear_table</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
    <span class="k">if</span> <span class="p">((</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">i</span><span class="p">;</span> <span class="c1">// boundary!</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(1)</em> per byte, consisting of one left-shift, one table lookup, and one addition.</div><div><strong>Space:</strong> <em>O(256)</em> for the lookup table. No window buffer needed.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-marker">2015</div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot extrema"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">AE - Asymmetric Extremum<span class="cdc-cite"><a href="#ref-7">[7]</a></span></div>
        <span class="cdc-tax-family-label extrema">Extrema</span>
      </div>
      <div class="cdc-tl-year">2015</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-ae" id="tab-ae-1" checked="" />
        <label for="tab-ae-1">Overview</label>
        <input type="radio" name="tab-ae" id="tab-ae-2" />
        <label for="tab-ae-2">Application</label>
        <input type="radio" name="tab-ae" id="tab-ae-3" />
        <label for="tab-ae-3">Code</label>
        <input type="radio" name="tab-ae" id="tab-ae-4" />
        <label for="tab-ae-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">A complete departure from the hash-based lineage. AE finds chunk boundaries by scanning for the <strong>maximum byte value</strong> within a sliding window of size <em>w</em>. A boundary is declared when the maximum is at the rightmost position of the window. It is called "asymmetric" because the check is one-sided: the max only needs to beat the preceding bytes, not the following ones. This naturally produces chunks whose sizes center around the window size. The approach eliminates all hash computation (no multiplication, no XOR, no table lookups), using only byte comparisons. The trade-off: a naive implementation rescans the entire window for each byte position, giving <em>O(w)</em> per byte, though a monotonic deque can reduce this to <em>O(1)</em> amortized.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Reported 3&times; throughput over Rabin-based CDC while achieving comparable deduplication ratios on real-world datasets.<span class="cdc-cite"><a href="#ref-7">[7]</a></span> The simplicity of pure byte comparisons makes AE an important baseline in the local-extrema lineage, and its boundary decisions are inherently SIMD-parallelizable, a property later exploited by VectorCDC.<span class="cdc-cite"><a href="#ref-13">[13]</a></span></div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre><span class="c1">// AE: boundary when max byte is at window's right edge</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">min_size</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">uint8_t</span> <span class="n">max_val</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">max_pos</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">i</span> <span class="o">-</span> <span class="n">w</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;=</span> <span class="n">i</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="n">max_val</span><span class="p">)</span>
            <span class="p">{</span> <span class="n">max_val</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">];</span> <span class="n">max_pos</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span> <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">max_pos</span> <span class="o">==</span> <span class="n">i</span><span class="p">)</span> <span class="k">return</span> <span class="n">i</span><span class="p">;</span> <span class="c1">// boundary!</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(w)</em> per byte naive, <em>O(1)</em> amortized with monotonic deque.</div><div><strong>Space:</strong> <em>O(w)</em> for the sliding window.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot bsw"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">FastCDC<span class="cdc-cite"><a href="#ref-5">[5]</a></span><span class="cdc-cite"><a href="#ref-6">[6]</a></span></div>
        <span class="cdc-tax-family-label bsw">BSW</span>
      </div>
      <div class="cdc-tl-year">2016</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-fastcdc" id="tab-fastcdc-1" checked="" />
        <label for="tab-fastcdc-1">Overview</label>
        <input type="radio" name="tab-fastcdc" id="tab-fastcdc-2" />
        <label for="tab-fastcdc-2">Application</label>
        <input type="radio" name="tab-fastcdc" id="tab-fastcdc-3" />
        <label for="tab-fastcdc-3">Code</label>
        <input type="radio" name="tab-fastcdc" id="tab-fastcdc-4" />
        <label for="tab-fastcdc-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Builds directly on Gear (2014) and addresses a fundamental weakness of all single-threshold CDC: the exponential chunk-size distribution that produces many tiny chunks and occasional very large ones. FastCDC's key contribution is <strong>Normalized Chunking</strong>, a dual-mask strategy that uses a stricter mask (more bits must be zero) for positions below the expected average, and a looser mask (fewer bits) for positions above it. This "squeezes" the distribution toward a bell curve, dramatically improving deduplication by reducing both tiny chunks (which waste metadata) and huge chunks (which reduce sharing). The inner loop remains identical to Gear (one shift, one add, one mask check), so the dual-mask adds zero per-byte overhead. Combined with cut-point skipping (jumping past <code>min_size</code> bytes), FastCDC reported 10&times; throughput over Rabin-based CDC while matching or exceeding its deduplication ratio.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Reported 10&times; throughput over Rabin-based CDC and 3&times; over standalone Gear and AE, while matching or exceeding their deduplication ratios.<span class="cdc-cite"><a href="#ref-5">[5]</a></span> The 2020 enhancement (rolling two bytes per iteration) added another 30-40% throughput.<span class="cdc-cite"><a href="#ref-6">[6]</a></span> Adopted as the default chunker in open-source projects including <strong>Rdedup</strong>, and actively maintained as the <strong>fastcdc-rs</strong> Rust crate.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="c1">// FastCDC: Gear hash + normalized chunking</span>
<span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">min</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">avg</span> <span class="o">&amp;&amp;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>              <span class="c1">// phase 1: strict mask</span>
    <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">gear_table</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_s</span><span class="p">))</span> <span class="k">return</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">max</span> <span class="o">&amp;&amp;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>              <span class="c1">// phase 2: loose mask</span>
    <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">gear_table</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">hash</span> <span class="o">&amp;</span> <span class="n">mask_l</span><span class="p">))</span> <span class="k">return</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">i</span><span class="p">;</span>                                      <span class="c1">// hit max chunk size</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(1)</em> per byte, identical to Gear. The dual-mask is a branch on position, not a per-byte cost.</div><div><strong>Space:</strong> <em>O(256)</em> for the Gear table.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot extrema"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">RAM - Rapid Asymmetric Maximum<span class="cdc-cite"><a href="#ref-8">[8]</a></span></div>
        <span class="cdc-tax-family-label extrema">Extrema</span>
      </div>
      <div class="cdc-tl-year">2017</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-ram" id="tab-ram-1" checked="" />
        <label for="tab-ram-1">Overview</label>
        <input type="radio" name="tab-ram" id="tab-ram-2" />
        <label for="tab-ram-2">Application</label>
        <input type="radio" name="tab-ram" id="tab-ram-3" />
        <label for="tab-ram-3">Code</label>
        <input type="radio" name="tab-ram" id="tab-ram-4" />
        <label for="tab-ram-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Refines AE's extremum approach with a critical performance optimization. RAM uses an <strong>asymmetric window</strong>: a small lookback (e.g., 256 bytes) and a larger lookahead (roughly the target chunk size). A boundary is declared when the current byte is the maximum of both windows combined. The key insight is the <strong>skip optimization</strong>: when a byte is <em>not</em> the maximum in the lookahead, the algorithm jumps directly to the position of the actual maximum, bypassing all intermediate positions. This provides sublinear average-case behavior, where bytes examined per boundary is roughly proportional to chunk size, not chunk size times window size. Like AE, RAM uses only byte comparisons with no arithmetic, making it attractive for resource-constrained environments.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Targets cloud storage deduplication, where the skip optimization reduces the number of bytes examined per boundary.<span class="cdc-cite"><a href="#ref-8">[8]</a></span> When the current byte is not the lookahead maximum, the algorithm jumps directly to the actual maximum, giving sublinear average-case behavior on data with sparse boundaries. Like AE, RAM's boundary decisions are SIMD-parallelizable: VectorCDC's VRAM variant achieves 17&times; speedup using AVX-512.<span class="cdc-cite"><a href="#ref-13">[13]</a></span></div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="c1">// RAM: skip to the max, don't scan past it</span>
<span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">min</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">max_pos</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;=</span> <span class="n">i</span><span class="o">+</span><span class="n">ahead</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>       <span class="c1">// scan lookahead</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="n">data</span><span class="p">[</span><span class="n">max_pos</span><span class="p">])</span> <span class="n">max_pos</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">max_pos</span> <span class="o">!=</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="n">i</span> <span class="o">=</span> <span class="n">max_pos</span><span class="p">;</span> <span class="k">continue</span><span class="p">;</span> <span class="p">}</span>  <span class="c1">// skip!</span>
    <span class="n">bool</span> <span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">i</span><span class="o">-</span><span class="n">back</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">i</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>           <span class="c1">// check lookback</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span> <span class="n">ok</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ok</span><span class="p">)</span> <span class="k">return</span> <span class="n">i</span><span class="p">;</span>                             <span class="c1">// boundary!</span>
    <span class="n">i</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(1)</em> amortized per byte due to skip optimization (worst case <em>O(w)</em>).</div><div><strong>Space:</strong> <em>O(w<sub>back</sub> + w<sub>ahead</sub>)</em> for the window buffers.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot extrema"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">MII - Maximum of Interval-length Independent<span class="cdc-cite"><a href="#ref-9">[9]</a></span></div>
        <span class="cdc-tax-family-label extrema">Extrema</span>
      </div>
      <div class="cdc-tl-year">2019</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-mii" id="tab-mii-1" checked="" />
        <label for="tab-mii-1">Overview</label>
        <input type="radio" name="tab-mii" id="tab-mii-2" />
        <label for="tab-mii-2">Application</label>
        <input type="radio" name="tab-mii" id="tab-mii-3" />
        <label for="tab-mii-3">Code</label>
        <input type="radio" name="tab-mii" id="tab-mii-4" />
        <label for="tab-mii-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Builds on AE and RAM but solves a practical problem: in AE/RAM, changing the target chunk size parameters changes which positions are boundary candidates, destroying deduplication against previously stored data. MII <strong>decouples</strong> the context window from the chunk size parameters. It uses a larger window <em>W</em> (often 2&times; the target) and identifies all positions that are the maximum of their <em>W</em>-neighborhood as boundary <em>candidates</em>. Separately, it filters these candidates to respect min/max chunk constraints. This "interval-length independent" property means the same byte positions will be candidates regardless of configuration, enabling stable deduplication across different chunk size settings and even multi-resolution deduplication.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Targets <strong>incremental backup and data synchronization</strong>, where changing chunk-size parameters between backup generations should not invalidate prior deduplication.<span class="cdc-cite"><a href="#ref-9">[9]</a></span> In benchmarks, MII reduced incremental data by 13-34% compared to other algorithms at comparable throughput. The interval-length independence property also enables multi-resolution deduplication, where different storage tiers can use different chunk granularities while sharing boundary candidates.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="c1">// MII: boundary candidates are independent of chunk size</span>
<span class="kt">size_t</span> <span class="n">last</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">bool</span> <span class="n">is_max</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="o">-</span><span class="n">W</span><span class="o">/</span><span class="mi">2</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;=</span> <span class="n">W</span><span class="o">/</span><span class="mi">2</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>      <span class="c1">// large symmetric window</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span> <span class="n">is_max</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">is_max</span><span class="p">)</span> <span class="k">continue</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="n">last</span> <span class="o">&gt;=</span> <span class="n">min</span><span class="p">)</span> <span class="p">{</span>                 <span class="c1">// filter by min chunk size</span>
        <span class="n">last</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
        <span class="n">emit_boundary</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>                  <span class="c1">// boundary!</span>
    <span class="p">}</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(W)</em> per byte naive, <em>O(1)</em> amortized with monotonic deque.</div><div><strong>Space:</strong> <em>O(W)</em> for the context window, where <em>W &gt; w</em>.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-marker">2020</div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot bsw"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">PCI - Popcount Independence<span class="cdc-cite"><a href="#ref-10">[10]</a></span></div>
        <span class="cdc-tax-family-label bsw">BSW</span>
      </div>
      <div class="cdc-tl-year">2020</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-pci" id="tab-pci-1" checked="" />
        <label for="tab-pci-1">Overview</label>
        <input type="radio" name="tab-pci" id="tab-pci-2" />
        <label for="tab-pci-2">Application</label>
        <input type="radio" name="tab-pci" id="tab-pci-3" />
        <label for="tab-pci-3">Code</label>
        <input type="radio" name="tab-pci" id="tab-pci-4" />
        <label for="tab-pci-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Takes an unusual approach within the BSW family: instead of computing a hash, PCI counts the number of <strong>1-bits</strong> (Hamming weight) in a sliding window of raw bytes. A boundary is declared when the popcount exceeds a threshold &theta;. Since the popcount of random bytes follows a binomial distribution, the threshold directly controls the average chunk size. What makes this surprisingly practical is hardware support: modern x86 and ARM CPUs have dedicated <code>POPCNT</code> instructions that count bits in a single cycle. No hash tables, no polynomial arithmetic, no random lookup tables. It is just counting bits in the raw data. The sliding window update is also simple: add the incoming byte's popcount, subtract the outgoing byte's.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Designed for <strong>incremental data synchronization</strong> rather than storage deduplication.<span class="cdc-cite"><a href="#ref-10">[10]</a></span> PCI's popcount-based boundaries improve resistance to byte-shift propagation: in benchmarks, PCI improved Rsync calculation speed by up to 70% and reduced detected incremental data by 32-57% compared to other CDC algorithms. The trade-off is less uniform chunk-size distribution, making PCI better suited for sync workloads than dedup-ratio-sensitive storage.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="c1">// PCI: boundary when bit-population exceeds threshold</span>
<span class="kt">int</span> <span class="n">pop_sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">pop_sum</span> <span class="o">+=</span> <span class="n">__builtin_popcount</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>       <span class="c1">// add incoming</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="n">w</span><span class="p">)</span>
        <span class="n">pop_sum</span> <span class="o">-=</span> <span class="n">__builtin_popcount</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="n">w</span><span class="p">]);</span> <span class="c1">// remove outgoing</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="n">min</span> <span class="o">&amp;&amp;</span> <span class="n">pop_sum</span> <span class="o">&gt;=</span> <span class="n">threshold</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">i</span><span class="p">;</span> <span class="c1">// boundary!</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(1)</em> per byte, consisting of one hardware POPCNT for the incoming byte and one subtraction for the outgoing.</div><div><strong>Space:</strong> <em>O(w)</em> for the sliding window buffer.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot statistical"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">BFBC - Byte-Frequency Based Chunking<span class="cdc-cite"><a href="#ref-11">[11]</a></span></div>
        <span class="cdc-tax-family-label statistical">Statistical</span>
      </div>
      <div class="cdc-tl-year">2020</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-bfbc" id="tab-bfbc-1" checked="" />
        <label for="tab-bfbc-1">Overview</label>
        <input type="radio" name="tab-bfbc" id="tab-bfbc-2" />
        <label for="tab-bfbc-2">Application</label>
        <input type="radio" name="tab-bfbc" id="tab-bfbc-3" />
        <label for="tab-bfbc-3">Code</label>
        <input type="radio" name="tab-bfbc" id="tab-bfbc-4" />
        <label for="tab-bfbc-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">A fundamentally different two-pass approach. In the first pass, BFBC scans the data and builds a frequency table of all <strong>byte pairs</strong> (digrams), then selects the top-<em>k</em> most common pairs. In the second pass, it scans linearly and declares a boundary whenever one of these high-frequency digrams appears (subject to min/max constraints). The insight: common digrams are inherently content-defined and recur consistently regardless of insertions or deletions elsewhere, serving as natural landmarks. Once the frequency table is built, the boundary detection pass is a simple table lookup per position. The fundamental trade-off: the pre-scan makes it <strong>unsuitable for streaming</strong>, and on high-entropy data (compressed files, encrypted content) the digram frequencies flatten out, destroying the algorithm's ability to find meaningful boundaries.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Reported 10&times; faster chunking than Rabin-based BSW and 3&times; faster than TTTD (Two Thresholds Two Divisors).<span class="cdc-cite"><a href="#ref-11">[11]</a></span> Best suited for batch processing of known file types where the two-pass cost is acceptable. Works well when digram distributions are stable and distinctive, such as structured documents or source code.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="c1">// BFBC Phase 1: build digram frequency table</span>
<span class="kt">uint32_t</span> <span class="n">freq</span><span class="p">[</span><span class="mi">256</span><span class="p">][</span><span class="mi">256</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
    <span class="n">freq</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]][</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]]</span><span class="o">++</span><span class="p">;</span>
<span class="n">bool</span> <span class="n">is_cut</span><span class="p">[</span><span class="mi">256</span><span class="p">][</span><span class="mi">256</span><span class="p">]</span> <span class="o">=</span> <span class="n">select_top_k</span><span class="p">(</span><span class="n">freq</span><span class="p">,</span> <span class="n">k</span><span class="p">);</span>

<span class="c1">// Phase 2: chunk using frequent digrams as boundaries</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">min</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">is_cut</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]][</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]])</span>
        <span class="k">return</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// boundary after digram</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(n)</em> pre-scan + <em>O(1)</em> per byte for boundary detection.</div><div><strong>Space:</strong> <em>O(256&times;256)</em> for the digram frequency table.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="cdc-tl-marker">2025</div>

  <div class="cdc-tl-entry">
    <div class="cdc-tl-dot extrema"></div>
    <div class="cdc-tl-card">
      <div class="cdc-tl-header">
        <div class="cdc-tl-name">VectorCDC<span class="cdc-cite"><a href="#ref-13">[13]</a></span></div>
        <span class="cdc-tax-family-label extrema">Extrema</span>
      </div>
      <div class="cdc-tl-year">2025</div>
      <div class="cdc-tl-tabs">
        <input type="radio" name="tab-vectorcdc" id="tab-vectorcdc-1" checked="" />
        <label for="tab-vectorcdc-1">Overview</label>
        <input type="radio" name="tab-vectorcdc" id="tab-vectorcdc-2" />
        <label for="tab-vectorcdc-2">Application</label>
        <input type="radio" name="tab-vectorcdc" id="tab-vectorcdc-3" />
        <label for="tab-vectorcdc-3">Code</label>
        <input type="radio" name="tab-vectorcdc" id="tab-vectorcdc-4" />
        <label for="tab-vectorcdc-4">Analysis</label>
        <div class="cdc-tl-tab-content">
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">Demonstrates that Local Extrema algorithms are <strong>inherently SIMD-parallel</strong> in a way hash-based algorithms are not. Finding a local maximum across a window of bytes is a parallel comparison, exactly what SSE/AVX packed-max and packed-compare instructions are designed for. VectorCDC loads 16 bytes (SSE) or 32 bytes (AVX2) into a vector register and uses <code>_mm256_max_epu8</code> to compare all bytes simultaneously, extracting boundary candidates via <code>movemask</code>. Hash-based algorithms resist this because each hash update depends sequentially on the previous one. The VRAM variant (vectorized RAM) achieves <strong>16-42&times;</strong> throughput over scalar implementations, approaching memory bandwidth limits (~10-15 GB/s). Deduplication ratios remain identical since the boundary decisions are mathematically equivalent.</div>
          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc">VRAM (vectorized RAM) achieves <strong>6.5-30 GB/s</strong> with AVX-512, a 17&times; speedup over scalar RAM, approaching memory bandwidth limits.<span class="cdc-cite"><a href="#ref-13">[13]</a></span> Tested across 10 workloads spanning VM backups, database snapshots, source code repositories, and web archives. Because the boundary decisions are mathematically identical to the scalar algorithms, VectorCDC is a drop-in replacement that trades only wider SIMD hardware requirements for higher throughput.</div>
          </div>
          <div class="cdc-tl-tab-panel">

<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="c1">// VectorCDC: SIMD-accelerated local max (AVX2)</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">32</span> <span class="o">&lt;=</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="mi">32</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">__m256i</span> <span class="n">curr</span> <span class="o">=</span> <span class="n">_mm256_loadu_si256</span><span class="p">(</span><span class="n">data</span> <span class="o">+</span> <span class="n">i</span><span class="p">);</span>
    <span class="n">__m256i</span> <span class="n">prev</span> <span class="o">=</span> <span class="n">_mm256_loadu_si256</span><span class="p">(</span><span class="n">data</span> <span class="o">+</span> <span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">__m256i</span> <span class="n">next</span> <span class="o">=</span> <span class="n">_mm256_loadu_si256</span><span class="p">(</span><span class="n">data</span> <span class="o">+</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">__m256i</span> <span class="n">is_max</span> <span class="o">=</span> <span class="n">_mm256_and_si256</span><span class="p">(</span>
        <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">curr</span><span class="p">,</span> <span class="n">_mm256_max_epu8</span><span class="p">(</span><span class="n">curr</span><span class="p">,</span> <span class="n">prev</span><span class="p">)),</span>
        <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">curr</span><span class="p">,</span> <span class="n">_mm256_max_epu8</span><span class="p">(</span><span class="n">curr</span><span class="p">,</span> <span class="n">next</span><span class="p">)));</span>
    <span class="kt">uint32_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="n">_mm256_movemask_epi8</span><span class="p">(</span><span class="n">is_max</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">mask</span><span class="p">)</span> <span class="k">return</span> <span class="n">i</span> <span class="o">+</span> <span class="n">__builtin_ctz</span><span class="p">(</span><span class="n">mask</span><span class="p">);</span> <span class="c1">// first local max</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

          </div>
          <div class="cdc-tl-tab-panel">
            <div class="cdc-tl-desc"><div><strong>Time:</strong> <em>O(1/32)</em> per byte with AVX2, processing 32 bytes per instruction and approaching memory bandwidth.</div><div><strong>Space:</strong> <em>O(1)</em> beyond the data, requiring only a few vector registers with no tables or buffers.</div></div>
          </div>
        </div>
      </div>
    </div>
  </div>

</div>

<p>In the next post, <a href="/writings/content-defined-chunking-part-2">Part 2: A Deep Dive into FastCDC</a>, we’ll take a closer look at the BSW family through FastCDC, an algorithm that combines Gear hashing with normalized chunking to achieve both high throughput and excellent deduplication.</p>

<hr />

<h3 id="references">References</h3>

<div class="cdc-references">

<div class="bib-entry" id="ref-1">
  <div class="bib-number">[1]</div>
  <div class="bib-citation">M. O. Rabin, "Fingerprinting by Random Polynomials," Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/rabin-1981-fingerprinting.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://scholar.google.com/scholar?q=rabin+fingerprinting+random+polynomials+1981" class="bib-link external"><i class="fa-solid fa-magnifying-glass"></i> Google Scholar</a>
  </div>
</div>

<div class="bib-entry" id="ref-2">
  <div class="bib-number">[2]</div>
  <div class="bib-citation">A. Muthitacharoen, B. Chen &amp; D. Mazieres, "A Low-Bandwidth Network File System," <em>Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP)</em>, 2001.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/lbfs-2001-muthitacharoen.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://dl.acm.org/doi/10.1145/502034.502052" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ACM DL</a>
  </div>
</div>

<div class="bib-entry" id="ref-3">
  <div class="bib-number">[3]</div>
  <div class="bib-citation">J. D. Cohen, "Recursive Hashing Functions for N-Grams," <em>ACM Transactions on Information Systems</em>, vol. 15, no. 3, pp. 291-320, 1997.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/buzhash-1997-cohen.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://dl.acm.org/doi/10.1145/256163.256168" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ACM DL</a>
  </div>
</div>

<div class="bib-entry" id="ref-4">
  <div class="bib-number">[4]</div>
  <div class="bib-citation">W. Xia, H. Jiang, D. Feng &amp; L. Tian, "Ddelta: A Deduplication-Inspired Fast Delta Compression Approach," <em>Performance Evaluation</em>, vol. 79, pp. 258-272, 2014.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/ddelta-2014-xia.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://www.sciencedirect.com/science/article/abs/pii/S0166531614000790" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ScienceDirect</a>
  </div>
</div>

<div class="bib-entry" id="ref-5">
  <div class="bib-number">[5]</div>
  <div class="bib-citation">W. Xia, H. Jiang, D. Feng, L. Tian, M. Fu &amp; Y. Zhou, "FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication," <em>Proceedings of the USENIX Annual Technical Conference (ATC)</em>, 2016.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/fastcdc-2016-xia.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

<div class="bib-entry" id="ref-6">
  <div class="bib-number">[6]</div>
  <div class="bib-citation">W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu &amp; Y. Zhang, "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems," <em>IEEE Transactions on Parallel and Distributed Systems</em>, vol. 31, no. 9, pp. 2017-2031, 2020.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/fastcdc-2020-xia.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://ieeexplore.ieee.org/document/9055082" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE Xplore</a>
  </div>
</div>

<div class="bib-entry" id="ref-7">
  <div class="bib-number">[7]</div>
  <div class="bib-citation">Y. Zhang, H. Jiang, D. Feng, W. Xia, M. Fu, F. Huang &amp; Y. Zhou, "AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication," <em>Proceedings of IEEE INFOCOM</em>, 2015.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/ae-2015-zhang.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://ieeexplore.ieee.org/document/7218510" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE Xplore</a>
  </div>
</div>

<div class="bib-entry" id="ref-8">
  <div class="bib-number">[8]</div>
  <div class="bib-citation">R. N. Widodo, H. Lim &amp; M. Atiquzzaman, "A New Content-Defined Chunking Algorithm for Data Deduplication in Cloud Storage," <em>Future Generation Computer Systems</em>, vol. 71, pp. 145-156, 2017.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/ram-2017-widodo.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://www.sciencedirect.com/science/article/abs/pii/S0167739X16305829" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> ScienceDirect</a>
  </div>
</div>

<div class="bib-entry" id="ref-9">
  <div class="bib-number">[9]</div>
  <div class="bib-citation">C. Zhang, D. Qi, W. Li &amp; J. Guo, "MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization," <em>IEEE Access</em>, vol. 7, pp. 86862-86875, 2019.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/mii-2019-zhang.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://ieeexplore.ieee.org/document/8752387" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE Xplore</a>
  </div>
</div>

<div class="bib-entry" id="ref-10">
  <div class="bib-number">[10]</div>
  <div class="bib-citation">C. Zhang, D. Qi, W. Li &amp; J. Guo, "Function of Content Defined Chunking Algorithms in Incremental Synchronization," <em>IEEE Access</em>, vol. 8, pp. 5316-5330, 2020.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/pci-2020-zhang.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://ieeexplore.ieee.org/document/8949536" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> IEEE Xplore</a>
  </div>
</div>

<div class="bib-entry" id="ref-11">
  <div class="bib-number">[11]</div>
  <div class="bib-citation">A. S. M. Saeed &amp; L. E. George, "Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence," <em>Symmetry</em>, vol. 12, no. 11, article 1841, 2020.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/bfbc-2020-saeed.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://www.mdpi.com/2073-8994/12/11/1841" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> MDPI</a>
  </div>
</div>

<div class="bib-entry" id="ref-12">
  <div class="bib-number">[12]</div>
  <div class="bib-citation">M. Gregoriadis, L. Balduf, B. Scheuermann &amp; J. Pouwelse, "A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication," <em>arXiv:2409.06066</em>, 2024.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/gregoriadis-2024-survey.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://arxiv.org/abs/2409.06066" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> arXiv</a>
  </div>
</div>

<div class="bib-entry" id="ref-13">
  <div class="bib-number">[13]</div>
  <div class="bib-citation">S. Udayashankar, A. Baba &amp; A. Al-Kiswany, "VectorCDC: Accelerating Data Deduplication with Vector Instructions," <em>Proceedings of the 23rd USENIX Conference on File and Storage Technologies (FAST)</em>, 2025.</div>
  <div class="bib-links">
    <a href="/assets/papers/cdc/vectorcdc-2025-udayashankar.pdf" class="bib-link pdf"><i class="fa-solid fa-file-pdf"></i> PDF</a>
    <a href="https://www.usenix.org/conference/fast25/presentation/udayashankar" class="bib-link external"><i class="fa-solid fa-arrow-up-right-from-square"></i> USENIX</a>
  </div>
</div>

</div>

<hr />

<div class="cdc-series-nav">
Continue reading &rarr; <a href="/writings/content-defined-chunking-part-2">Part 2: A Deep Dive into FastCDC</a>
</div>

<script type="module" src="/assets/js/cdc-animations.js"></script>

]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Anatomy of a Line Field Animation</title>
      <link>https://rickwinfrey.com/writings/anatomy-of-a-line-field-animation</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/anatomy-of-a-line-field-animation</guid>
      <pubDate>Thu, 15 Jan 2026 12:00:00 +0000</pubDate>
      
      <description>How a simple grid of lines evolved into an organic, wind-driven canvas animation through iterative prompting.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<style>
/* Demo container - no border, auto height */
.demo-box {
  margin: 2rem 0;
}

.demo-box canvas {
  display: block;
  width: 100%;
  height: 500px;
  border-radius: 8px;
  background: #faf9f7;
}

/* Controls panel matching animations page */
.demo-controls {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(130px, 1fr));
  gap: 1.5rem;
  padding: 1.5rem;
  background: #fff;
  border-radius: 8px;
  border: 1px solid rgba(61, 58, 54, 0.1);
  margin-top: 1rem;
}

.demo-controls-featured {
  grid-column: 1 / -1;
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(130px, 1fr));
  gap: 1.5rem;
  padding-bottom: 1rem;
  border-bottom: 1px solid rgba(61, 58, 54, 0.1);
  margin-bottom: 0.5rem;
}

.demo-controls-row {
  grid-column: 1 / -1;
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(130px, 1fr));
  gap: 1.5rem;
}

.demo-controls-row + .demo-controls-row {
  margin-top: 1rem;
  padding-top: 1rem;
  border-top: 1px solid rgba(61, 58, 54, 0.06);
}

.control-group {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.control-label {
  font-family: 'Libre Baskerville', Georgia, serif;
  font-size: 0.9rem;
  color: #3d3a36;
}

.slider-container {
  display: flex;
  align-items: center;
}

.demo-controls input[type="range"] {
  width: 120px;
  height: 6px;
  -webkit-appearance: none;
  appearance: none;
  background: linear-gradient(to right, #d4a574, #c45a3b);
  border-radius: 3px;
  outline: none;
}

.demo-controls input[type="range"]::-webkit-slider-thumb {
  -webkit-appearance: none;
  appearance: none;
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
  transition: transform 0.15s ease;
}

.demo-controls input[type="range"]::-webkit-slider-thumb:hover {
  transform: scale(1.1);
}

.demo-controls input[type="range"]::-moz-range-thumb {
  width: 16px;
  height: 16px;
  background: #c45a3b;
  border-radius: 50%;
  cursor: pointer;
  border: none;
  box-shadow: 0 2px 4px rgba(0,0,0,0.2);
}

.demo-controls input[type="color"] {
  width: 40px;
  height: 28px;
  border: none;
  border-radius: 4px;
  cursor: pointer;
  padding: 0;
  background: none;
}

.demo-controls input[type="color"]::-webkit-color-swatch-wrapper {
  padding: 0;
}

.demo-controls input[type="color"]::-webkit-color-swatch {
  border: 2px solid rgba(61, 58, 54, 0.2);
  border-radius: 4px;
}

/* Compact compass for demos */
.compass-small {
  width: 80px;
  height: 80px;
  border-radius: 50%;
  background: #faf9f7;
  border: 2px solid rgba(61, 58, 54, 0.2);
  position: relative;
  cursor: pointer;
  margin: 0 auto;
}

.compass-small .compass-direction {
  position: absolute;
  font-size: 0.6rem;
  font-weight: 600;
  color: #8b7355;
}

.compass-small .compass-n { top: 4px; left: 50%; transform: translateX(-50%); }
.compass-small .compass-s { bottom: 4px; left: 50%; transform: translateX(-50%); }
.compass-small .compass-e { right: 4px; top: 50%; transform: translateY(-50%); }
.compass-small .compass-w { left: 4px; top: 50%; transform: translateY(-50%); }

.compass-small .compass-needle {
  position: absolute;
  top: 50%;
  left: 50%;
  width: 3px;
  height: 30px;
  background: linear-gradient(to top, #c45a3b 50%, #3d3a36 50%);
  transform-origin: center bottom;
  transform: translateX(-50%) translateY(-100%);
  border-radius: 2px;
  pointer-events: none;
}

.compass-small .compass-center {
  position: absolute;
  top: 50%;
  left: 50%;
  width: 8px;
  height: 8px;
  background: #c45a3b;
  border-radius: 50%;
  transform: translate(-50%, -50%);
  pointer-events: none;
  box-shadow: 0 1px 3px rgba(0,0,0,0.2);
}

@media (max-width: 42em) {
  .demo-controls {
    grid-template-columns: 1fr;
  }
  .demo-controls-featured {
    grid-template-columns: 1fr;
  }
}

/* Prompt callout styling */
.prompt {
  position: relative;
  margin: 1.5rem 0;
  padding: 1.25rem 1.5rem 1.25rem 1.25rem;
  background: linear-gradient(135deg, rgba(196, 90, 59, 0.06) 0%, rgba(212, 165, 116, 0.08) 100%);
  border-left: 3px solid #c45a3b;
  border-radius: 0 6px 6px 0;
  font-style: italic;
  color: #3d3a36;
  line-height: 1.6;
}

.prompt::before {
  content: "Prompt";
  position: absolute;
  top: -0.6rem;
  left: 1rem;
  font-size: 0.7rem;
  font-weight: 600;
  font-style: normal;
  text-transform: uppercase;
  letter-spacing: 0.08em;
  color: #c45a3b;
  background: #faf9f7;
  padding: 0 0.4rem;
}
</style>

<p>What started as fun idea, and low-stakes desire to add a subtle wind movement background animation to my homepage, turned into an exploration of JavaScript canvas rendering, noise functions, planar waves, and the surprisingly rich parameter space of moving lines.</p>

<p>This post walks through the mental model behind the line field animation used as the background for this website, how it evolved from a static grid to an organic wind simulation, and learnings and reflection on creative coding through conversational iteration with LLM’s.</p>

<hr />

<h2 id="the-mental-model">The Mental Model</h2>

<p>At its core, the animation is simple: draw a field of parallel lines on a canvas, then displace each point along those lines based on mathematical functions that change over time to simulate wind.</p>

<p>The key insight is that <strong>lines are just sequences of points</strong>. If you can control where each point sits, you can make the line wave, ripple, or flow.</p>

<h3 id="how-wind-moves">How Wind “Moves”</h3>

<p>Wind isn’t modeled as emanating from a point source. Instead, it’s a <strong>traveling planar wave</strong>. A useful reference is ocean waves approaching a beach rather than ripples from a dropped stone. Each point’s position is projected onto the wind direction axis, and a sine wave sweeps along that axis over time:</p>

<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>  <span class="nx">displacement</span> <span class="o">=</span> <span class="nx">sin</span><span class="p">(</span><span class="nx">position_along_wind_axis</span> <span class="o">+</span> <span class="nx">time</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>As time advances, the sine wave pattern shifts in the wind direction, creating the illusion of wind sweeping across the field. With two wind sources at different angles, the waves interfere to create complex, naturalistic patterns.</p>

<h3 id="layered-displacement">Layered Displacement</h3>

<p>The final displacement of each point combines several layers:</p>

<ol>
  <li><strong>Simplex noise</strong> — Organic, spatially-coherent randomness that scrolls in the wind direction</li>
  <li><strong>Traveling sine waves</strong> — Parallel wavefronts for each wind source</li>
  <li><strong>Per-line modifiers</strong> — Jitter (phase randomization), whisp (amplitude variation), and gust envelopes</li>
</ol>

<p>Each layer operates at different spatial frequencies and serves a different purpose, but they all scroll or animate over time to create cohesive movement.</p>

<hr />

<h2 id="the-evolution-from-static-to-organic">The Evolution: From Static to Organic</h2>

<p>The animation started simple, just a field of lines. Each turn in the session increased the complexity by adding additional parameters and controls. Below is a timeline of how the animation evolved along with the prompt used for each step.</p>

<h3 id="stage-1-a-grid-of-lines">Stage 1: A Grid of Lines</h3>

<p>The first version was trivially simple: parallel lines at a fixed angle, evenly spaced.</p>

<div class="prompt">
"Add a field of lines parameterized by line width, line spacing (the distance between lines), and line direction. Also add controls for line color and opacity. Let's add this to /animations in isolation."
</div>

<p>The LLM (opus 4-5) provided this initial skeleton including a canvas element, a render loop, and controls for the requested parameters.</p>

<p><strong>Controls:</strong></p>
<ul>
  <li><strong>Line Direction</strong> — The angle at which lines are drawn across the canvas</li>
  <li><strong>Line Spacing</strong> — Distance between parallel lines (smaller = denser field)</li>
  <li><strong>Line Width</strong> — Stroke thickness of each line</li>
  <li><strong>Opacity</strong> — Transparency of the lines</li>
  <li><strong>Line Color</strong> — The color of the lines</li>
</ul>

<div class="demo-box">
  <div class="demo-canvas-wrapper">
    <canvas id="demo1"></canvas>
    <div class="demo-paused-overlay paused" id="demo1-overlay">
      <button class="play-btn" aria-label="Play">
        <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
      </button>
      <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
    </div>
  </div>
  <div class="demo-controls">
    <div class="control-group">
      <label class="control-label">Line Direction</label>
      <div class="compass-small" id="demo1-orient-compass">
        <span class="compass-direction compass-n">N</span>
        <span class="compass-direction compass-s">S</span>
        <span class="compass-direction compass-e">E</span>
        <span class="compass-direction compass-w">W</span>
        <div class="compass-needle" id="demo1-orient-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
        <div class="compass-center"></div>
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Line Spacing</label>
      <div class="slider-container">
        <input type="range" id="demo1-spacing" min="4" max="20" value="7" />
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Line Width</label>
      <div class="slider-container">
        <input type="range" id="demo1-width" min="0.3" max="2" value="0.7" step="0.1" />
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Opacity</label>
      <div class="slider-container">
        <input type="range" id="demo1-opacity" min="0" max="1" value="0.9" step="0.1" />
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Line Color</label>
      <input type="color" id="demo1-color" value="#b4afa5" />
    </div>
  </div>
</div>

<script type="module">
import { LineFieldAnimation } from '/assets/js/line-field.js';

const canvas = document.getElementById('demo1');
const animation = new LineFieldAnimation(canvas, {
  mode: 'container',
  params: {
    spacing: 7,
    width: 0.7,
    wind: 0,
    wind2: 0,
    jitter: 0,
    jitterDiameter: 0,
    opacityDepthFactor: 0,
    lineOrientation: Math.PI / 2,
    color: '#b4afa5'
  }
});

// Mobile overlay handling
const isMobile = window.matchMedia('(max-width: 42em)').matches;
const overlay = document.getElementById('demo1-overlay');
const playBtn = overlay.querySelector('.play-btn');
let hasRendered = false;

function render() {
  requestAnimationFrame(() => {
    animation.resize();
    animation.renderOnce();
  });
}

// Re-render on resize since Stage 1 is static (no animation loop)
window.addEventListener('resize', () => {
  if (hasRendered) {
    render();
  }
});

if (isMobile) {
  // Mobile: wait for user to tap play
  playBtn.addEventListener('click', () => {
    overlay.classList.remove('paused');
    hasRendered = true;
    render();
  });
} else {
  // Desktop: render immediately and hide overlay
  overlay.classList.remove('paused');
  hasRendered = true;
  if (document.readyState === 'complete') {
    render();
  } else {
    window.addEventListener('load', render);
  }
}

// Compass interaction
const orientCompass = document.getElementById('demo1-orient-compass');
const orientNeedle = document.getElementById('demo1-orient-needle');
let dragging = false;

function updateOrient(e) {
  const rect = orientCompass.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ lineOrientation: angle });
  orientNeedle.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
  animation.renderOnce();
}

orientCompass.addEventListener('mousedown', (e) => { dragging = true; updateOrient(e); });
document.addEventListener('mousemove', (e) => { if (dragging) updateOrient(e); });
document.addEventListener('mouseup', () => { dragging = false; });
orientCompass.addEventListener('touchstart', (e) => { dragging = true; updateOrient(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (dragging) updateOrient(e.touches[0]); });
document.addEventListener('touchend', () => { dragging = false; });

document.getElementById('demo1-spacing').addEventListener('input', (e) => {
  animation.setParams({ spacing: parseInt(e.target.value) });
  animation.renderOnce();
});

document.getElementById('demo1-width').addEventListener('input', (e) => {
  animation.setParams({ width: parseFloat(e.target.value) });
  animation.renderOnce();
});

document.getElementById('demo1-opacity').addEventListener('input', (e) => {
  animation.setParams({ opacity: parseFloat(e.target.value) });
  animation.renderOnce();
});

document.getElementById('demo1-color').addEventListener('input', (e) => {
  animation.setParams({ color: e.target.value });
  animation.renderOnce();
});
</script>

<script type="module">
// Shared animation controller for mobile/desktop behavior
window.setupAnimatedDemo = function(animation, canvasId, overlayId, pauseId) {
  const isMobile = window.matchMedia('(max-width: 42em)').matches;
  const canvas = document.getElementById(canvasId);
  const overlay = document.getElementById(overlayId);
  const pauseBtn = document.getElementById(pauseId);
  const playBtn = overlay.querySelector('.play-btn');
  let isPlaying = false;
  let isVisible = false;
  let userStarted = false;

  function hideOverlay() {
    overlay.classList.remove('paused');
  }

  function showOverlay() {
    overlay.classList.add('paused');
  }

  function showPauseBtn() {
    pauseBtn.classList.add('playing');
  }

  function hidePauseBtn() {
    pauseBtn.classList.remove('playing');
  }

  function play() {
    if (!isPlaying) {
      isPlaying = true;
      animation.start();
      hideOverlay();
      if (isMobile) showPauseBtn();
    }
  }

  function pause() {
    if (isPlaying) {
      isPlaying = false;
      animation.stop();
      animation.renderOnce();
      hidePauseBtn();
      // Show overlay on mobile when paused by user
      if (isMobile) {
        showOverlay();
      }
    }
  }

  // Intersection Observer - pause when not visible
  const observer = new IntersectionObserver((entries) => {
    entries.forEach(entry => {
      isVisible = entry.isIntersecting;
      if (isVisible) {
        // On mobile: only play if user has explicitly started it
        // On desktop: auto-play when visible
        if (!isMobile || userStarted) {
          play();
        }
      } else {
        // Pause but don't show overlay when scrolling away
        if (isPlaying) {
          isPlaying = false;
          animation.stop();
          animation.renderOnce();
          hidePauseBtn();
        }
      }
    });
  }, { threshold: 0.1 });

  observer.observe(canvas);

  // Play button handler (mobile overlay)
  playBtn.addEventListener('click', () => {
    userStarted = true;
    play();
  });

  // Pause button handler (mobile corner button)
  pauseBtn.addEventListener('click', () => {
    userStarted = false;
    pause();
  });

  // Initial state
  if (isMobile) {
    // Mobile: start paused with overlay, render once for preview
    animation.renderOnce();
  } else {
    // Desktop: hide overlay, will auto-play when visible via Intersection Observer
    hideOverlay();
    animation.renderOnce();
  }

  return { play, pause, isPlaying: () => isPlaying };
};
</script>

<h3 id="stage-2-wind-and-movement">Stage 2: Wind and Movement</h3>

<div class="prompt">
"Let's now add movement to the field of lines. I'd like to simulate wind blowing over the lines, parameterized by wind speed and wind direction. I'd also like wind size to represent the overall size of a single wind element's influence over a line. The larger the wind size, the more it displaces the length of the line. Let's also include jitter and jitter diameter per line. The jitter influences the randomness by which a line reacts to wind, and the jitter diameter controls the distance a line can move from its origin."
</div>

<p>Wind is modeled as a <strong>sine wave traveling in a direction</strong> causing displacement of the points making up the field of lines, resulting in the illusion of movement. Points along the wind axis move together, creating the illusion of cohesive wind pushing through the field. <strong>Simplex noise</strong> adds organic variation—each point samples a 2D noise field, shifting perpendicular to the line direction.</p>

<p><strong>Controls:</strong></p>
<ul>
  <li><strong>Wind Speed</strong> — How fast the wave travels through the field</li>
  <li><strong>Wind Direction</strong> — The direction the wave propagates</li>
  <li><strong>Wind Size</strong> — Wavelength of the wind pattern (larger = broader, gentler curves)</li>
  <li><strong>Jitter Amount</strong> — Randomizes each line’s wave phase (0 = lines move in sync, higher = lines move independently/chaotically)</li>
  <li><strong>Jitter Diameter</strong> — Overall displacement magnitude (larger = points move further from origin)</li>
</ul>

<div class="demo-box">
  <div class="demo-canvas-wrapper">
    <canvas id="demo2"></canvas>
    <div class="demo-paused-overlay paused" id="demo2-overlay">
      <button class="play-btn" aria-label="Play">
        <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
      </button>
      <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
    </div>
    <button class="demo-pause-btn" id="demo2-pause" aria-label="Pause">
      <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
    </button>
  </div>
  <div class="demo-controls">
    <div class="demo-controls-featured">
      <div class="demo-controls-row">
        <div class="control-group">
          <label class="control-label">Wind Direction</label>
          <div class="compass-small" id="demo2-wind-compass">
            <span class="compass-direction compass-n">N</span>
            <span class="compass-direction compass-s">S</span>
            <span class="compass-direction compass-e">E</span>
            <span class="compass-direction compass-w">W</span>
            <div class="compass-needle" id="demo2-wind-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
            <div class="compass-center"></div>
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Wind Speed</label>
          <div class="slider-container">
            <input type="range" id="demo2-speed" min="0" max="10" value="4" step="0.5" />
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Wind Size</label>
          <div class="slider-container">
            <input type="range" id="demo2-size" min="0" max="100" value="99" />
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Jitter Amount</label>
          <div class="slider-container">
            <input type="range" id="demo2-jitter" min="0" max="100" value="1" />
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Jitter Diameter</label>
          <div class="slider-container">
            <input type="range" id="demo2-diameter" min="5" max="40" value="22" />
          </div>
        </div>
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Line Direction</label>
      <div class="compass-small" id="demo2-orient-compass">
        <span class="compass-direction compass-n">N</span>
        <span class="compass-direction compass-s">S</span>
        <span class="compass-direction compass-e">E</span>
        <span class="compass-direction compass-w">W</span>
        <div class="compass-needle" id="demo2-orient-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
        <div class="compass-center"></div>
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Line Spacing</label>
      <div class="slider-container">
        <input type="range" id="demo2-spacing" min="4" max="20" value="7" />
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Line Width</label>
      <div class="slider-container">
        <input type="range" id="demo2-width" min="0.3" max="2" value="0.7" step="0.1" />
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Opacity</label>
      <div class="slider-container">
        <input type="range" id="demo2-opacity" min="0" max="1" value="0.9" step="0.1" />
      </div>
    </div>
    <div class="control-group">
      <label class="control-label">Line Color</label>
      <input type="color" id="demo2-color" value="#b4afa5" />
    </div>
  </div>
</div>

<script type="module">
import { LineFieldAnimation } from '/assets/js/line-field.js';

const canvas = document.getElementById('demo2');
const animation = new LineFieldAnimation(canvas, {
  mode: 'container',
  params: {
    spacing: 7,
    width: 0.7,
    wind: 4.0,
    windDirection: Math.PI / 2,
    wind2: 0,
    jitter: 1,
    jitterDiameter: 22,
    windDensity: 7,
    windSize: 99,
    opacityDepthFactor: 0,
    lineOrientation: Math.PI / 2,
    color: '#b4afa5'
  }
});
setupAnimatedDemo(animation, 'demo2', 'demo2-overlay', 'demo2-pause');

// Wind compass
const windCompass = document.getElementById('demo2-wind-compass');
const windNeedle = document.getElementById('demo2-wind-needle');
let draggingWind3 = false;

function updateWind3(e) {
  const rect = windCompass.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ windDirection: angle });
  windNeedle.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}

windCompass.addEventListener('mousedown', (e) => { draggingWind3 = true; updateWind3(e); });
document.addEventListener('mousemove', (e) => { if (draggingWind3) updateWind3(e); });
document.addEventListener('mouseup', () => { draggingWind3 = false; });
windCompass.addEventListener('touchstart', (e) => { draggingWind3 = true; updateWind3(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingWind3) updateWind3(e.touches[0]); });
document.addEventListener('touchend', () => { draggingWind3 = false; });

// Orient compass
const orientCompass = document.getElementById('demo2-orient-compass');
const orientNeedle = document.getElementById('demo2-orient-needle');
let draggingOrient3 = false;

function updateOrient3(e) {
  const rect = orientCompass.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ lineOrientation: angle });
  orientNeedle.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}

orientCompass.addEventListener('mousedown', (e) => { draggingOrient3 = true; updateOrient3(e); });
document.addEventListener('mousemove', (e) => { if (draggingOrient3) updateOrient3(e); });
document.addEventListener('mouseup', () => { draggingOrient3 = false; });
orientCompass.addEventListener('touchstart', (e) => { draggingOrient3 = true; updateOrient3(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingOrient3) updateOrient3(e.touches[0]); });
document.addEventListener('touchend', () => { draggingOrient3 = false; });

document.getElementById('demo2-speed').addEventListener('input', (e) => {
  animation.setParams({ wind: parseFloat(e.target.value) });
});
document.getElementById('demo2-size').addEventListener('input', (e) => {
  animation.setParams({ windSize: parseInt(e.target.value) });
});
document.getElementById('demo2-spacing').addEventListener('input', (e) => {
  animation.setParams({ spacing: parseInt(e.target.value) });
});
document.getElementById('demo2-width').addEventListener('input', (e) => {
  animation.setParams({ width: parseFloat(e.target.value) });
});
document.getElementById('demo2-opacity').addEventListener('input', (e) => {
  animation.setParams({ opacity: parseFloat(e.target.value) });
});
document.getElementById('demo2-color').addEventListener('input', (e) => {
  animation.setParams({ color: e.target.value });
});
document.getElementById('demo2-jitter').addEventListener('input', (e) => {
  animation.setParams({ jitter: parseInt(e.target.value) });
});
document.getElementById('demo2-diameter').addEventListener('input', (e) => {
  animation.setParams({ jitterDiameter: parseInt(e.target.value) });
});
</script>

<h3 id="stage-3-dual-wind-sources">Stage 3: Dual Wind Sources</h3>

<div class="prompt">
"Let's add an additional wind source, with an independent wind direction and wind speed. Let's model displacement as wave interference between the two wave sources. I'd also like to add a wind density parameter that controls the spacing between wind points of origin."
</div>

<p>Adding a second wind with independent direction and speed created <strong>interference patterns</strong>. The interaction between winds produces complex, naturalistic movement that neither wind creates alone.</p>

<p><strong>New controls:</strong></p>
<ul>
  <li><strong>Wind 2 Speed/Direction</strong> — A second independent wind source that combines with the first</li>
  <li><strong>Wind Density</strong> — How many wave cycles fit in the viewport (higher = more turbulent appearance)</li>
</ul>

<div class="demo-box">
  <div class="demo-canvas-wrapper">
    <canvas id="demo3"></canvas>
    <div class="demo-paused-overlay paused" id="demo3-overlay">
      <button class="play-btn" aria-label="Play">
        <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
      </button>
      <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
    </div>
    <button class="demo-pause-btn" id="demo3-pause" aria-label="Pause">
      <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
    </button>
  </div>
  <div class="demo-controls">
    <div class="demo-controls-featured">
      <div class="demo-controls-row">
        <div class="control-group">
          <label class="control-label">Wind Density</label>
          <div class="slider-container">
            <input type="range" id="demo3-density" min="0" max="100" value="7" />
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Wind Size</label>
          <div class="slider-container">
            <input type="range" id="demo3-size" min="0" max="100" value="99" />
          </div>
        </div>
      </div>
      <div class="demo-controls-row">
        <div class="control-group">
          <label class="control-label">Wind 1 Direction</label>
          <div class="compass-small" id="demo3-wind1-compass">
            <span class="compass-direction compass-n">N</span>
            <span class="compass-direction compass-s">S</span>
            <span class="compass-direction compass-e">E</span>
            <span class="compass-direction compass-w">W</span>
            <div class="compass-needle" id="demo3-wind1-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
            <div class="compass-center"></div>
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Wind 1 Speed</label>
          <div class="slider-container">
            <input type="range" id="demo3-wind1" min="0" max="10" value="4" step="0.5" />
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Wind 2 Direction</label>
          <div class="compass-small" id="demo3-wind2-compass">
            <span class="compass-direction compass-n">N</span>
            <span class="compass-direction compass-s">S</span>
            <span class="compass-direction compass-e">E</span>
            <span class="compass-direction compass-w">W</span>
            <div class="compass-needle" id="demo3-wind2-needle" style="transform: translateX(-50%) translateY(-100%) rotate(-45deg);"></div>
            <div class="compass-center"></div>
          </div>
        </div>
        <div class="control-group">
          <label class="control-label">Wind 2 Speed</label>
          <div class="slider-container">
            <input type="range" id="demo3-wind2" min="0" max="10" value="3" step="0.5" />
          </div>
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Jitter Amount</label>
        <div class="slider-container">
          <input type="range" id="demo3-jitter" min="0" max="100" value="1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Jitter Diameter</label>
        <div class="slider-container">
          <input type="range" id="demo3-diameter" min="5" max="40" value="22" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Line Direction</label>
        <div class="compass-small" id="demo3-orient-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo3-orient-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Spacing</label>
        <div class="slider-container">
          <input type="range" id="demo3-spacing" min="4" max="20" value="7" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Width</label>
        <div class="slider-container">
          <input type="range" id="demo3-width" min="0.3" max="2" value="0.7" step="0.1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Opacity</label>
        <div class="slider-container">
          <input type="range" id="demo3-opacity" min="0" max="1" value="0.9" step="0.1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Color</label>
        <input type="color" id="demo3-color" value="#b4afa5" />
      </div>
    </div>
  </div>
</div>

<script type="module">
import { LineFieldAnimation } from '/assets/js/line-field.js';

const canvas = document.getElementById('demo3');
const animation = new LineFieldAnimation(canvas, {
  mode: 'container',
  params: {
    spacing: 7,
    width: 0.7,
    wind: 4.0,
    windDirection: Math.PI / 2,
    wind2: 3.0,
    wind2Direction: -Math.PI * 3 / 4,
    jitter: 1,
    jitterDiameter: 22,
    windDensity: 7,
    windSize: 99,
    opacityDepthFactor: 0,
    lineOrientation: Math.PI / 2,
    color: '#b4afa5'
  }
});
setupAnimatedDemo(animation, 'demo3', 'demo3-overlay', 'demo3-pause');

// Wind 2 compass (featured)
const wind2Compass = document.getElementById('demo3-wind2-compass');
const wind2Needle = document.getElementById('demo3-wind2-needle');
let draggingWind2_4 = false;

function updateWind2_4(e) {
  const rect = wind2Compass.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ wind2Direction: angle });
  wind2Needle.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}

wind2Compass.addEventListener('mousedown', (e) => { draggingWind2_4 = true; updateWind2_4(e); });
document.addEventListener('mousemove', (e) => { if (draggingWind2_4) updateWind2_4(e); });
document.addEventListener('mouseup', () => { draggingWind2_4 = false; });
wind2Compass.addEventListener('touchstart', (e) => { draggingWind2_4 = true; updateWind2_4(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingWind2_4) updateWind2_4(e.touches[0]); });
document.addEventListener('touchend', () => { draggingWind2_4 = false; });

// Wind 1 compass
const wind1Compass = document.getElementById('demo3-wind1-compass');
const wind1Needle = document.getElementById('demo3-wind1-needle');
let draggingWind1_4 = false;

function updateWind1_4(e) {
  const rect = wind1Compass.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ windDirection: angle });
  wind1Needle.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}

wind1Compass.addEventListener('mousedown', (e) => { draggingWind1_4 = true; updateWind1_4(e); });
document.addEventListener('mousemove', (e) => { if (draggingWind1_4) updateWind1_4(e); });
document.addEventListener('mouseup', () => { draggingWind1_4 = false; });
wind1Compass.addEventListener('touchstart', (e) => { draggingWind1_4 = true; updateWind1_4(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingWind1_4) updateWind1_4(e.touches[0]); });
document.addEventListener('touchend', () => { draggingWind1_4 = false; });

// Orient compass
const orientCompass = document.getElementById('demo3-orient-compass');
const orientNeedle = document.getElementById('demo3-orient-needle');
let draggingOrient4 = false;

function updateOrient4(e) {
  const rect = orientCompass.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ lineOrientation: angle });
  orientNeedle.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}

orientCompass.addEventListener('mousedown', (e) => { draggingOrient4 = true; updateOrient4(e); });
document.addEventListener('mousemove', (e) => { if (draggingOrient4) updateOrient4(e); });
document.addEventListener('mouseup', () => { draggingOrient4 = false; });
orientCompass.addEventListener('touchstart', (e) => { draggingOrient4 = true; updateOrient4(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingOrient4) updateOrient4(e.touches[0]); });
document.addEventListener('touchend', () => { draggingOrient4 = false; });

document.getElementById('demo3-wind2').addEventListener('input', (e) => {
  animation.setParams({ wind2: parseFloat(e.target.value) });
});
document.getElementById('demo3-wind1').addEventListener('input', (e) => {
  animation.setParams({ wind: parseFloat(e.target.value) });
});
document.getElementById('demo3-density').addEventListener('input', (e) => {
  animation.setParams({ windDensity: parseInt(e.target.value) });
});
document.getElementById('demo3-size').addEventListener('input', (e) => {
  animation.setParams({ windSize: parseInt(e.target.value) });
});
document.getElementById('demo3-spacing').addEventListener('input', (e) => {
  animation.setParams({ spacing: parseInt(e.target.value) });
});
document.getElementById('demo3-width').addEventListener('input', (e) => {
  animation.setParams({ width: parseFloat(e.target.value) });
});
document.getElementById('demo3-jitter').addEventListener('input', (e) => {
  animation.setParams({ jitter: parseInt(e.target.value) });
});
document.getElementById('demo3-diameter').addEventListener('input', (e) => {
  animation.setParams({ jitterDiameter: parseInt(e.target.value) });
});
document.getElementById('demo3-opacity').addEventListener('input', (e) => {
  animation.setParams({ opacity: parseFloat(e.target.value) });
});
document.getElementById('demo3-color').addEventListener('input', (e) => {
  animation.setParams({ color: e.target.value });
});
</script>

<h3 id="stage-4-depth-effect">Stage 4: Depth Effect</h3>

<div class="prompt">
"I'd like to add depth. Let's imagine the field of lines as a horizontal plane. Lines displaced above the plane appear closer to the viewer. Lines displaced below the plane appear farther away. I'd like a depth effect parameter that controls the opacity. When set to 0, all lines appear with constant opacity regardless of displacement. When set to 10, lines with greatest negative displacement (i.e. lines that appear to be further away) have 0 opacity and are fully translucent.
</div>

<p>Using the noise displacement value to <strong>modulate opacity</strong>: points that displace more appear “closer” and more opaque. This simple trick adds surprising depth to a 2D animation.</p>

<p><strong>New control:</strong></p>
<ul>
  <li><strong>Depth Effect</strong> — How strongly displacement affects opacity (0 = uniform opacity, 1 = maximum depth variation)</li>
</ul>

<div class="demo-box">
  <div class="demo-canvas-wrapper">
    <canvas id="demo4"></canvas>
    <div class="demo-paused-overlay paused" id="demo4-overlay">
      <button class="play-btn" aria-label="Play">
        <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
      </button>
      <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
    </div>
    <button class="demo-pause-btn" id="demo4-pause" aria-label="Pause">
      <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
    </button>
  </div>
  <div class="demo-controls">
    <div class="demo-controls-featured">
      <div class="control-group">
        <label class="control-label">Depth Effect</label>
        <div class="slider-container">
          <input type="range" id="demo4-depth" min="0" max="1" value="0.6" step="0.1" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Wind Density</label>
        <div class="slider-container">
          <input type="range" id="demo4-density" min="0" max="100" value="7" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind Size</label>
        <div class="slider-container">
          <input type="range" id="demo4-size" min="0" max="100" value="99" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Wind 1 Direction</label>
        <div class="compass-small" id="demo4-wind1-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo4-wind1-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 1 Speed</label>
        <div class="slider-container">
          <input type="range" id="demo4-wind1" min="0" max="10" value="4" step="0.5" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 2 Direction</label>
        <div class="compass-small" id="demo4-wind2-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo4-wind2-needle" style="transform: translateX(-50%) translateY(-100%) rotate(-45deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 2 Speed</label>
        <div class="slider-container">
          <input type="range" id="demo4-wind2" min="0" max="10" value="3" step="0.5" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Jitter Amount</label>
        <div class="slider-container">
          <input type="range" id="demo4-jitter" min="0" max="100" value="1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Jitter Diameter</label>
        <div class="slider-container">
          <input type="range" id="demo4-diameter" min="5" max="40" value="22" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Line Direction</label>
        <div class="compass-small" id="demo4-orient-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo4-orient-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Spacing</label>
        <div class="slider-container">
          <input type="range" id="demo4-spacing" min="4" max="20" value="7" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Opacity</label>
        <div class="slider-container">
          <input type="range" id="demo4-opacity" min="0" max="1" value="0.9" step="0.1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Color</label>
        <input type="color" id="demo4-color" value="#b4afa5" />
      </div>
    </div>
  </div>
</div>

<script type="module">
import { LineFieldAnimation } from '/assets/js/line-field.js';

const canvas = document.getElementById('demo4');
const animation = new LineFieldAnimation(canvas, {
  mode: 'container',
  params: {
    spacing: 7,
    width: 0.7,
    wind: 4.0,
    windDirection: Math.PI / 2,
    wind2: 3.0,
    wind2Direction: -Math.PI * 3 / 4,
    jitter: 1,
    jitterDiameter: 22,
    windDensity: 7,
    windSize: 99,
    opacityDepthFactor: 0.6,
    lineOrientation: Math.PI / 2,
    color: '#b4afa5'
  }
});
setupAnimatedDemo(animation, 'demo4', 'demo4-overlay', 'demo4-pause');

// Wind 1 compass
const wind1Compass5 = document.getElementById('demo4-wind1-compass');
const wind1Needle5 = document.getElementById('demo4-wind1-needle');
let draggingW1_5 = false;
function updateW1_5(e) {
  const rect = wind1Compass5.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ windDirection: angle });
  wind1Needle5.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
wind1Compass5.addEventListener('mousedown', (e) => { draggingW1_5 = true; updateW1_5(e); });
document.addEventListener('mousemove', (e) => { if (draggingW1_5) updateW1_5(e); });
document.addEventListener('mouseup', () => { draggingW1_5 = false; });
wind1Compass5.addEventListener('touchstart', (e) => { draggingW1_5 = true; updateW1_5(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingW1_5) updateW1_5(e.touches[0]); });
document.addEventListener('touchend', () => { draggingW1_5 = false; });

// Wind 2 compass
const wind2Compass5 = document.getElementById('demo4-wind2-compass');
const wind2Needle5 = document.getElementById('demo4-wind2-needle');
let draggingW2_5 = false;
function updateW2_5(e) {
  const rect = wind2Compass5.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ wind2Direction: angle });
  wind2Needle5.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
wind2Compass5.addEventListener('mousedown', (e) => { draggingW2_5 = true; updateW2_5(e); });
document.addEventListener('mousemove', (e) => { if (draggingW2_5) updateW2_5(e); });
document.addEventListener('mouseup', () => { draggingW2_5 = false; });
wind2Compass5.addEventListener('touchstart', (e) => { draggingW2_5 = true; updateW2_5(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingW2_5) updateW2_5(e.touches[0]); });
document.addEventListener('touchend', () => { draggingW2_5 = false; });

// Orient compass
const orientCompass5 = document.getElementById('demo4-orient-compass');
const orientNeedle5 = document.getElementById('demo4-orient-needle');
let draggingO5 = false;
function updateO5(e) {
  const rect = orientCompass5.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ lineOrientation: angle });
  orientNeedle5.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
orientCompass5.addEventListener('mousedown', (e) => { draggingO5 = true; updateO5(e); });
document.addEventListener('mousemove', (e) => { if (draggingO5) updateO5(e); });
document.addEventListener('mouseup', () => { draggingO5 = false; });
orientCompass5.addEventListener('touchstart', (e) => { draggingO5 = true; updateO5(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingO5) updateO5(e.touches[0]); });
document.addEventListener('touchend', () => { draggingO5 = false; });

document.getElementById('demo4-depth').addEventListener('input', (e) => {
  animation.setParams({ opacityDepthFactor: parseFloat(e.target.value) });
});
document.getElementById('demo4-density').addEventListener('input', (e) => {
  animation.setParams({ windDensity: parseInt(e.target.value) });
});
document.getElementById('demo4-wind1').addEventListener('input', (e) => {
  animation.setParams({ wind: parseFloat(e.target.value) });
});
document.getElementById('demo4-wind2').addEventListener('input', (e) => {
  animation.setParams({ wind2: parseFloat(e.target.value) });
});
document.getElementById('demo4-size').addEventListener('input', (e) => {
  animation.setParams({ windSize: parseInt(e.target.value) });
});
document.getElementById('demo4-spacing').addEventListener('input', (e) => {
  animation.setParams({ spacing: parseInt(e.target.value) });
});
document.getElementById('demo4-jitter').addEventListener('input', (e) => {
  animation.setParams({ jitter: parseInt(e.target.value) });
});
document.getElementById('demo4-diameter').addEventListener('input', (e) => {
  animation.setParams({ jitterDiameter: parseInt(e.target.value) });
});
document.getElementById('demo4-opacity').addEventListener('input', (e) => {
  animation.setParams({ opacity: parseFloat(e.target.value) });
});
document.getElementById('demo4-color').addEventListener('input', (e) => {
  animation.setParams({ color: e.target.value });
});
</script>

<h3 id="stage-5-whisp-effect">Stage 5: Whisp Effect</h3>

<div class="prompt">
"Can we make individual wind elements feel more whispy?"
</div>

<p>I left this prompt intentionally vague to see how the model would respond. First it attempted to add per-point turbulence, but this was too noisy. Eventually, we settled on this concept of “whisp”, which controls the intensity of wave amplitude on the displacement of clusters of lines.</p>

<p>Whisp as a control works well with jitter, but they are distinct controls. Jitter is a parameter that influences lines independently, and alters the line’s offset in relation to a wave. Whisp is a source of noise that moves slowly over the field and influences clusters of lines. As whisp noise moves over clusters of lines, it multiplies the amplitude of independent waves (i.e. wind) interacting with the same cluster of lines. This means whisp is an independent temporal noise source that moves through the field lines, independently of the two wind sources, allowing for temporary stronger wind effects.</p>

<p><strong>New control:</strong></p>
<ul>
  <li><strong>Whisp</strong> — Per-line variation in wind response (0 = all lines move uniformly, higher = some lines and their neighbors catch more wind than others)</li>
</ul>

<div class="demo-box">
  <div class="demo-canvas-wrapper">
    <canvas id="demo5"></canvas>
    <div class="demo-paused-overlay paused" id="demo5-overlay">
      <button class="play-btn" aria-label="Play">
        <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
      </button>
      <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
    </div>
    <button class="demo-pause-btn" id="demo5-pause" aria-label="Pause">
      <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
    </button>
  </div>
  <div class="demo-controls">
    <div class="demo-controls-featured">
      <div class="control-group">
        <label class="control-label">Whisp</label>
        <div class="slider-container">
          <input type="range" id="demo5-whisp" min="0" max="100" value="50" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Depth Effect</label>
        <div class="slider-container">
          <input type="range" id="demo5-depth" min="0" max="1" value="0.6" step="0.1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind Density</label>
        <div class="slider-container">
          <input type="range" id="demo5-density" min="0" max="100" value="7" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind Size</label>
        <div class="slider-container">
          <input type="range" id="demo5-size" min="0" max="100" value="99" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Wind 1 Direction</label>
        <div class="compass-small" id="demo5-wind1-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo5-wind1-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 1 Speed</label>
        <div class="slider-container">
          <input type="range" id="demo5-wind1" min="0" max="10" value="4" step="0.5" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 2 Direction</label>
        <div class="compass-small" id="demo5-wind2-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo5-wind2-needle" style="transform: translateX(-50%) translateY(-100%) rotate(-45deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 2 Speed</label>
        <div class="slider-container">
          <input type="range" id="demo5-wind2" min="0" max="10" value="3" step="0.5" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Jitter Amount</label>
        <div class="slider-container">
          <input type="range" id="demo5-jitter" min="0" max="100" value="1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Jitter Diameter</label>
        <div class="slider-container">
          <input type="range" id="demo5-diameter" min="5" max="40" value="22" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Line Direction</label>
        <div class="compass-small" id="demo5-orient-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo5-orient-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Spacing</label>
        <div class="slider-container">
          <input type="range" id="demo5-spacing" min="4" max="20" value="7" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Opacity</label>
        <div class="slider-container">
          <input type="range" id="demo5-opacity" min="0" max="1" value="0.9" step="0.1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Color</label>
        <input type="color" id="demo5-color" value="#b4afa5" />
      </div>
    </div>
  </div>
</div>

<script type="module">
import { LineFieldAnimation } from '/assets/js/line-field.js';

const canvas = document.getElementById('demo5');
const animation = new LineFieldAnimation(canvas, {
  mode: 'container',
  params: {
    spacing: 7,
    width: 0.7,
    wind: 4.0,
    windDirection: Math.PI / 2,
    wind2: 3.0,
    wind2Direction: -Math.PI * 3 / 4,
    jitter: 1,
    jitterDiameter: 22,
    windDensity: 7,
    windSize: 99,
    opacityDepthFactor: 0.6,
    whisp: 50,
    lineOrientation: Math.PI / 2,
    color: '#b4afa5'
  }
});
setupAnimatedDemo(animation, 'demo5', 'demo5-overlay', 'demo5-pause');

// Wind 1 compass
const wind1Compass6 = document.getElementById('demo5-wind1-compass');
const wind1Needle6 = document.getElementById('demo5-wind1-needle');
let draggingW1_6 = false;
function updateW1_6(e) {
  const rect = wind1Compass6.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ windDirection: angle });
  wind1Needle6.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
wind1Compass6.addEventListener('mousedown', (e) => { draggingW1_6 = true; updateW1_6(e); });
document.addEventListener('mousemove', (e) => { if (draggingW1_6) updateW1_6(e); });
document.addEventListener('mouseup', () => { draggingW1_6 = false; });
wind1Compass6.addEventListener('touchstart', (e) => { draggingW1_6 = true; updateW1_6(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingW1_6) updateW1_6(e.touches[0]); });
document.addEventListener('touchend', () => { draggingW1_6 = false; });

// Wind 2 compass
const wind2Compass6 = document.getElementById('demo5-wind2-compass');
const wind2Needle6 = document.getElementById('demo5-wind2-needle');
let draggingW2_6 = false;
function updateW2_6(e) {
  const rect = wind2Compass6.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ wind2Direction: angle });
  wind2Needle6.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
wind2Compass6.addEventListener('mousedown', (e) => { draggingW2_6 = true; updateW2_6(e); });
document.addEventListener('mousemove', (e) => { if (draggingW2_6) updateW2_6(e); });
document.addEventListener('mouseup', () => { draggingW2_6 = false; });
wind2Compass6.addEventListener('touchstart', (e) => { draggingW2_6 = true; updateW2_6(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingW2_6) updateW2_6(e.touches[0]); });
document.addEventListener('touchend', () => { draggingW2_6 = false; });

// Orient compass
const orientCompass6 = document.getElementById('demo5-orient-compass');
const orientNeedle6 = document.getElementById('demo5-orient-needle');
let draggingO6 = false;
function updateO6(e) {
  const rect = orientCompass6.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ lineOrientation: angle });
  orientNeedle6.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
orientCompass6.addEventListener('mousedown', (e) => { draggingO6 = true; updateO6(e); });
document.addEventListener('mousemove', (e) => { if (draggingO6) updateO6(e); });
document.addEventListener('mouseup', () => { draggingO6 = false; });
orientCompass6.addEventListener('touchstart', (e) => { draggingO6 = true; updateO6(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingO6) updateO6(e.touches[0]); });
document.addEventListener('touchend', () => { draggingO6 = false; });

document.getElementById('demo5-whisp').addEventListener('input', (e) => {
  animation.setParams({ whisp: parseInt(e.target.value) });
});
document.getElementById('demo5-depth').addEventListener('input', (e) => {
  animation.setParams({ opacityDepthFactor: parseFloat(e.target.value) });
});
document.getElementById('demo5-wind1').addEventListener('input', (e) => {
  animation.setParams({ wind: parseFloat(e.target.value) });
});
document.getElementById('demo5-wind2').addEventListener('input', (e) => {
  animation.setParams({ wind2: parseFloat(e.target.value) });
});
document.getElementById('demo5-size').addEventListener('input', (e) => {
  animation.setParams({ windSize: parseInt(e.target.value) });
});
document.getElementById('demo5-density').addEventListener('input', (e) => {
  animation.setParams({ windDensity: parseInt(e.target.value) });
});
document.getElementById('demo5-spacing').addEventListener('input', (e) => {
  animation.setParams({ spacing: parseInt(e.target.value) });
});
document.getElementById('demo5-jitter').addEventListener('input', (e) => {
  animation.setParams({ jitter: parseInt(e.target.value) });
});
document.getElementById('demo5-diameter').addEventListener('input', (e) => {
  animation.setParams({ jitterDiameter: parseInt(e.target.value) });
});
document.getElementById('demo5-opacity').addEventListener('input', (e) => {
  animation.setParams({ opacity: parseFloat(e.target.value) });
});
document.getElementById('demo5-color').addEventListener('input', (e) => {
  animation.setParams({ color: e.target.value });
});
</script>

<h3 id="stage-6-gusts">Stage 6: Gusts</h3>

<div class="prompt">
"Let's add a gust parameter that controls the rate of wind noise generation for the two wind sources."
</div>

<p>Wind speed sets the maximum intensity. Gust modulates how much of that maximum is active in different regions.
Slow-moving spatial noise modulates each wind’s intensity. At low gust values, you see <strong>calm areas punctuated by gusts</strong> sweeping through.</p>

<p><strong>New controls:</strong></p>
<ul>
  <li><strong>Wind 1 Gust</strong> — Oscillation envelope for the first wind (10 = constant, 1 = mostly calm with occasional gusts)</li>
  <li><strong>Wind 2 Gust</strong> — Oscillation envelope for the second wind (independent of Wind 1)</li>
</ul>

<div class="demo-box">
  <div class="demo-canvas-wrapper">
    <canvas id="demo6"></canvas>
    <div class="demo-paused-overlay paused" id="demo6-overlay">
      <button class="play-btn" aria-label="Play">
        <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
      </button>
      <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
    </div>
    <button class="demo-pause-btn" id="demo6-pause" aria-label="Pause">
      <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
    </button>
  </div>
  <div class="demo-controls">
    <div class="demo-controls-featured">
      <div class="control-group">
        <label class="control-label">Wind 1 Gust</label>
        <div class="slider-container">
          <input type="range" id="demo6-gust1" min="1" max="10" value="3" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 2 Gust</label>
        <div class="slider-container">
          <input type="range" id="demo6-gust2" min="1" max="10" value="3" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Whisp</label>
        <div class="slider-container">
          <input type="range" id="demo6-whisp" min="0" max="100" value="30" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Depth Effect</label>
        <div class="slider-container">
          <input type="range" id="demo6-depth" min="0" max="1" value="0.6" step="0.1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind Density</label>
        <div class="slider-container">
          <input type="range" id="demo6-density" min="0" max="100" value="7" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind Size</label>
        <div class="slider-container">
          <input type="range" id="demo6-size" min="0" max="100" value="99" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Wind 1 Direction</label>
        <div class="compass-small" id="demo6-wind1-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo6-wind1-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 1 Speed</label>
        <div class="slider-container">
          <input type="range" id="demo6-wind1" min="0" max="10" value="4" step="0.5" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 2 Direction</label>
        <div class="compass-small" id="demo6-wind2-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo6-wind2-needle" style="transform: translateX(-50%) translateY(-100%) rotate(-45deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Wind 2 Speed</label>
        <div class="slider-container">
          <input type="range" id="demo6-wind2" min="0" max="10" value="3" step="0.5" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Jitter Amount</label>
        <div class="slider-container">
          <input type="range" id="demo6-jitter" min="0" max="100" value="1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Jitter Diameter</label>
        <div class="slider-container">
          <input type="range" id="demo6-diameter" min="5" max="40" value="22" />
        </div>
      </div>
    </div>
    <div class="demo-controls-row">
      <div class="control-group">
        <label class="control-label">Line Direction</label>
        <div class="compass-small" id="demo6-orient-compass">
          <span class="compass-direction compass-n">N</span>
          <span class="compass-direction compass-s">S</span>
          <span class="compass-direction compass-e">E</span>
          <span class="compass-direction compass-w">W</span>
          <div class="compass-needle" id="demo6-orient-needle" style="transform: translateX(-50%) translateY(-100%) rotate(180deg);"></div>
          <div class="compass-center"></div>
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Spacing</label>
        <div class="slider-container">
          <input type="range" id="demo6-spacing" min="4" max="20" value="7" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Opacity</label>
        <div class="slider-container">
          <input type="range" id="demo6-opacity" min="0" max="1" value="0.9" step="0.1" />
        </div>
      </div>
      <div class="control-group">
        <label class="control-label">Line Color</label>
        <input type="color" id="demo6-color" value="#b4afa5" />
      </div>
    </div>
  </div>
</div>

<script type="module">
import { LineFieldAnimation } from '/assets/js/line-field.js';

const canvas = document.getElementById('demo6');
const animation = new LineFieldAnimation(canvas, {
  mode: 'container',
  params: {
    spacing: 7,
    width: 0.7,
    wind: 4,
    windDirection: Math.PI / 2,
    windOscillation: 3,
    wind2: 3,
    wind2Direction: -Math.PI * 3 / 4,
    wind2Oscillation: 3,
    jitter: 1,
    jitterDiameter: 22,
    windDensity: 7,
    windSize: 99,
    opacityDepthFactor: 0.6,
    whisp: 30,
    lineOrientation: Math.PI / 2,
    color: '#b4afa5'
  }
});
setupAnimatedDemo(animation, 'demo6', 'demo6-overlay', 'demo6-pause');

// Wind 1 compass
const wind1Compass7 = document.getElementById('demo6-wind1-compass');
const wind1Needle7 = document.getElementById('demo6-wind1-needle');
let draggingW1_7 = false;
function updateW1_7(e) {
  const rect = wind1Compass7.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ windDirection: angle });
  wind1Needle7.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
wind1Compass7.addEventListener('mousedown', (e) => { draggingW1_7 = true; updateW1_7(e); });
document.addEventListener('mousemove', (e) => { if (draggingW1_7) updateW1_7(e); });
document.addEventListener('mouseup', () => { draggingW1_7 = false; });
wind1Compass7.addEventListener('touchstart', (e) => { draggingW1_7 = true; updateW1_7(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingW1_7) updateW1_7(e.touches[0]); });
document.addEventListener('touchend', () => { draggingW1_7 = false; });

// Wind 2 compass
const wind2Compass7 = document.getElementById('demo6-wind2-compass');
const wind2Needle7 = document.getElementById('demo6-wind2-needle');
let draggingW2_7 = false;
function updateW2_7(e) {
  const rect = wind2Compass7.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ wind2Direction: angle });
  wind2Needle7.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
wind2Compass7.addEventListener('mousedown', (e) => { draggingW2_7 = true; updateW2_7(e); });
document.addEventListener('mousemove', (e) => { if (draggingW2_7) updateW2_7(e); });
document.addEventListener('mouseup', () => { draggingW2_7 = false; });
wind2Compass7.addEventListener('touchstart', (e) => { draggingW2_7 = true; updateW2_7(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingW2_7) updateW2_7(e.touches[0]); });
document.addEventListener('touchend', () => { draggingW2_7 = false; });

// Orient compass
const orientCompass7 = document.getElementById('demo6-orient-compass');
const orientNeedle7 = document.getElementById('demo6-orient-needle');
let draggingO7 = false;
function updateO7(e) {
  const rect = orientCompass7.getBoundingClientRect();
  const angle = Math.atan2(e.clientY - rect.top - rect.height/2, e.clientX - rect.left - rect.width/2);
  animation.setParams({ lineOrientation: angle });
  orientNeedle7.style.transform = `translateX(-50%) translateY(-100%) rotate(${angle * 180/Math.PI + 90}deg)`;
}
orientCompass7.addEventListener('mousedown', (e) => { draggingO7 = true; updateO7(e); });
document.addEventListener('mousemove', (e) => { if (draggingO7) updateO7(e); });
document.addEventListener('mouseup', () => { draggingO7 = false; });
orientCompass7.addEventListener('touchstart', (e) => { draggingO7 = true; updateO7(e.touches[0]); e.preventDefault(); });
document.addEventListener('touchmove', (e) => { if (draggingO7) updateO7(e.touches[0]); });
document.addEventListener('touchend', () => { draggingO7 = false; });

document.getElementById('demo6-gust1').addEventListener('input', (e) => {
  animation.setParams({ windOscillation: parseInt(e.target.value) });
});
document.getElementById('demo6-gust2').addEventListener('input', (e) => {
  animation.setParams({ wind2Oscillation: parseInt(e.target.value) });
});
document.getElementById('demo6-whisp').addEventListener('input', (e) => {
  animation.setParams({ whisp: parseInt(e.target.value) });
});
document.getElementById('demo6-depth').addEventListener('input', (e) => {
  animation.setParams({ opacityDepthFactor: parseFloat(e.target.value) });
});
document.getElementById('demo6-wind1').addEventListener('input', (e) => {
  animation.setParams({ wind: parseFloat(e.target.value) });
});
document.getElementById('demo6-wind2').addEventListener('input', (e) => {
  animation.setParams({ wind2: parseFloat(e.target.value) });
});
document.getElementById('demo6-size').addEventListener('input', (e) => {
  animation.setParams({ windSize: parseInt(e.target.value) });
});
document.getElementById('demo6-density').addEventListener('input', (e) => {
  animation.setParams({ windDensity: parseInt(e.target.value) });
});
document.getElementById('demo6-spacing').addEventListener('input', (e) => {
  animation.setParams({ spacing: parseInt(e.target.value) });
});
document.getElementById('demo6-jitter').addEventListener('input', (e) => {
  animation.setParams({ jitter: parseInt(e.target.value) });
});
document.getElementById('demo6-diameter').addEventListener('input', (e) => {
  animation.setParams({ jitterDiameter: parseInt(e.target.value) });
});
document.getElementById('demo6-opacity').addEventListener('input', (e) => {
  animation.setParams({ opacity: parseFloat(e.target.value) });
});
document.getElementById('demo6-color').addEventListener('input', (e) => {
  animation.setParams({ color: e.target.value });
});
</script>

<hr />

<h2 id="prompting-for-creative-code">Prompting for Creative Code</h2>

<p>What surprised me most was how well conversational iteration works for this kind of project. Each prompt built on the last, and the AI maintained context about what we’d built.</p>

<p>A few patterns that worked well:</p>

<p><strong>Start vague, then refine</strong>: “Add a subtle animation” -&gt; “Make it wave” -&gt; “Add wind direction” -&gt; “Two wind sources”</p>

<p><strong>Incorporate feeling without controlling implementation</strong>: “Make it feel more whispy” would often lead to a better solution than “add turbulence to each point.”</p>

<p><strong>Iterate via feedback</strong>: When the first whisp implementation felt wrong, describing <em>why</em> (“too choppy, no gradual movement”) helped find a better approach.</p>

<p><strong>Use a sandbox</strong>: Adjusting parameters live and experimenting at the extreme ends of values is often more intuitive than reading the code alone.</p>

<hr />

<h2 id="the-sandbox-playing-with-parameters">The Sandbox: Playing with Parameters</h2>

<p>I find the general recipe of <strong>build a sandbox first, then refine through iteration</strong> works surprisingly well across a broad range of problems, not just for creative coding and play like this animation exercise. The underlying principle, which echoes themes from REPL-driven development, is the faster your feedback loop, the more you can explore and sample from the solution space to find an ideal solution.</p>

<p>If you want to play with the sandbox I used to create this animation yourself, visit the <a href="/animations">animations sandbox</a> and experiment. And if you build something interesting, I’d love to see it.</p>

<hr />

<h2 id="future-directions">Future Directions</h2>

<p>This line field animation was just a fun idea I had, but there are others I would like to explore:</p>

<h3 id="genetic-animations">Genetic Animations</h3>

<p>What if parameters evolved over time based on fitness functions? Lines that “survive” based on aesthetic criteria, gradually evolving toward interesting configurations.</p>

<h3 id="layered-animations">Layered Animations</h3>

<p>Multiple animation layers with different parameters, composited with blend modes. A fast, fine-grained layer over a slow, broad layer could create rich depth.</p>

<h3 id="interactive-response">Interactive Response</h3>

<p>Animations that respond to mouse position or cursor movement. Wind that flows away from the cursor, or lines that orient toward it.</p>

<h3 id="graph-based-visualizations">Graph-Based Visualizations</h3>

<p>Instead of parallel lines, what about connected graphs? Nodes that drift with noise, edges that stretch and compress, creating organic network visualizations.</p>

<h3 id="audio-reactive">Audio-Reactive</h3>

<p>Parameters modulated by audio input—bass driving wind speed, treble affecting jitter. The animation becomes a visualizer.</p>

<hr />

<h2 id="llms-open-new-dimensions">LLM’s open new dimensions</h2>

<p>What started as an amusing simple background animation experiment quickly became an exploration of noise functions, wave interference, and the expressive power of parameterized systems.</p>

<p>I would never have spent the time learning and playing with traveling planar waves, simplex noise, or linear interpolation, if it weren’t for the ease by which LLM’s enable this form of play. Just as each parameter in the animation opens a dimension of variation, LLM’s open a dimension of infinite creativity and exploration.</p>
]]>
      </content:encoded>
    </item>
    
    <item>
      <title>10 Lessons from 10 years at GitHub</title>
      <link>https://rickwinfrey.com/writings/10-lessons-from-github</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/10-lessons-from-github</guid>
      <pubDate>Fri, 04 Jul 2025 14:00:00 +0000</pubDate>
      
      <description>10 years. 10 lessons. The most important things I learned building developer tools at GitHub.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<p>A decade at GitHub as a developer rewired how I think about software.</p>

<p>While I contributed to GitHub’s Rails monolith, most of my work focused on building code intelligence services external to it.</p>

<p>This included contributing to <a href="https://tree-sitter.github.io/tree-sitter/">Tree-sitter</a> grammars and runtime, helping build a research-oriented program analysis library in Haskell called <a href="https://github.com/github/semantic">Semantic</a>, and shipping GitHub’s first <a href="https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github">code navigation</a> service powering jump-to-definition and find-all-references features.</p>

<p>Later, I helped evolve that system into zero-config, precise code navigation using <a href="https://github.com/github/stack-graphs">Stack Graphs</a>.</p>

<p>With the rise of generative AI, I contributed to Copilot Chat for the web and authored the prompt-building library behind its dynamic context logic.</p>

<p>Most recently, I spent the last two years working on GitHub’s code search system, <a href="https://github.blog/engineering/architecture-optimization/the-technology-behind-githubs-new-code-search/">Blackbird</a>.</p>

<p>I’ve captured the 10 lessons that mattered most, one for each year, that I’m taking with me.</p>

<hr />

<h3 id="1-the-core-is-the-moat">1. The core is the moat</h3>

<p>GitHub’s platform is indispensable, but only because its core experience is stable, fast, and reliable.</p>

<p>People often describe “moats” in terms of features, data, or network effects. But none of that matters if your foundation is broken. A product with 1,000 features has little value if 900 are buggy or slow.</p>

<p>The real moat? Relentlessly nailing the core experience. Every time.</p>

<hr />

<h3 id="2-build-first-for-customers">2. Build first for customers</h3>

<p>Before GitHub’s 2018 acquisition by Microsoft, the <a href="https://github.com/dear-github/dear-github">Dear GitHub</a> letter captured widespread frustration with the platform. Under Nat Friedman’s leadership, a “paper cuts” initiative helped rebuild trust with the open source community by fixing small but painful issues.</p>

<p>The lesson: Dogfooding is good, but over-relying on internal usage can lead to blind spots. You risk overfitting your product to narrow needs and overlooking real customer pain.</p>

<p>If you’re lucky, customers will tell you what’s broken. If you’re not, they’ll leave without saying a word.</p>

<hr />

<h3 id="3-make-it-work-make-it-scale-make-it-faster">3. Make it work. Make it scale. Make it faster.</h3>

<p>Kent Beck’s classic advice, “make it work, make it right, make it fast”, holds true. But at scale, I’ve found another ordering works better: “Make it work, <strong>make it scale, make it faster</strong>.”</p>

<p>When I joined the code search team, our backfill job, responsible for keeping our 140+ million repository index fresh, was taking five days to complete. This bottleneck made experimentation risky and recovery from failure slow.</p>

<p>After rounds of optimization, I cut the process from 5 days to 34 hours, a 72% improvement. It instantly increased trust in our system and unlocked team velocity.</p>

<p>Speed builds trust. Whether it’s internal tools or user-facing products, faster wins.</p>

<hr />

<h3 id="4-know-your-tools-especially-the-ones-you-build">4. Know your tools, especially the ones you build</h3>

<p>In dev tools, there’s no substitute for real usage. If you’re not using your own product, you’re building on assumptions.</p>

<p>Great engineers don’t just use tools. They study, tweak, and master them. They learn constantly, challenge defaults, and experiment with new workflows.</p>

<p>Follow your curiosity. Stay uncomfortable. Keep tinkering. Remember to play. Our craft improves with care.</p>

<hr />

<h3 id="5-good-telemetry-is-priceless-bad-telemetry-is-noise">5. Good telemetry is priceless. Bad telemetry is noise.</h3>

<p>If you can’t measure it, you can’t fix it. But if you measure everything, you can’t see anything.</p>

<p>Over-logging and dashboard bloat create fog, not clarity. The best observability isn’t about volume, it’s about relevance.</p>

<p>Keep dashboards lean. Prune aggressively. During every incident, ask:</p>

<ul>
  <li>What helped us resolve this faster?</li>
  <li>What slowed us down?</li>
</ul>

<p>Telemetry should evolve with your systems, or it’ll betray you when it matters most.</p>

<hr />

<h3 id="6-legacy-code-is-a-historic-renovation-project">6. Legacy code is a historic renovation project</h3>

<p>Legacy systems carry the business. Customers rely on them. Maintaining them is an honor, not a chore.</p>

<p>In software, unlike architecture, you can renovate without permits. But that takes care: knowing what to preserve and what to rework. This skill is hard to learn, but ensures systems can change and respond to business needs over time.</p>

<p>Leaders who overlook this work during performance reviews risk starving the systems that got them here. Refactoring legacy systems is hard, risky, and essential. It should be celebrated and rewarded.</p>

<hr />

<h3 id="7-software-is-a-team-sport">7. Software is a team sport</h3>

<p>No matter how skilled you are, your impact depends on how well you collaborate, within your team and across functions.</p>

<p>Some career-changing lessons I’ve learned:</p>

<ul>
  <li>Deliver consistently.</li>
  <li>Own what you commit.</li>
  <li>Reflect while building, not just after.</li>
  <li>Learn from teammates’ strengths and share your own.</li>
  <li>Communicate with care, it costs others time and energy.</li>
</ul>

<p>Growth isn’t just about skill. It’s about trust, connection, and shared momentum.</p>

<hr />

<h3 id="8-value-is-subjective">8. Value is subjective</h3>

<p>The best tech doesn’t always win. That’s because value is rarely objective.</p>

<p>Influence matters. And influence starts with making the case:</p>
<ul>
  <li>Show the business impact.</li>
  <li>Prove it with demos and results.</li>
  <li>Tell a compelling story. Repeat it.</li>
</ul>

<p>Even then, your best ideas might not win. That’s okay. Influence is a game with uneven footing, some seats are closer to the decision-maker. Play with integrity. Once the decision is made, commit and move forward.</p>

<hr />

<h3 id="9-read-the-research">9. Read the research</h3>

<p>With the explosion of generative AI, the gap between what’s published and what’s productized is wider than ever.</p>

<p>Amazing ideas are freely available on <a href="https://arxiv.org">Arxiv</a>. You don’t have to read everything, just skim abstracts to spot patterns.</p>

<p>If you go deeper, ask:</p>
<ul>
  <li>What are the key insight(s) behind this result?</li>
  <li>Can I extend this to other domains?</li>
  <li>How does it build on or challenge prior work?</li>
  <li>What new research does this enable?</li>
</ul>

<p>My two favorite tricks: use AI to summarize a paper’s bibliography and trace its intellectual lineage, and use AI to retrieve other research referencing the paper.</p>

<hr />

<h3 id="10-be-flexible-enough-to-adapt-be-focused-enough-to-matter">10. Be flexible enough to adapt. Be focused enough to matter.</h3>

<p>One of the best pieces of advice I got at GitHub: “Some degree of fungibility is good.”</p>

<p>Adaptability opens doors. Specialization gives you the leverage to walk through them.</p>

<p>If fungibility gets you in the room, depth helps you lead once you’re there.</p>

<hr />

<h3 id="final-note">Final note</h3>

<p>Working at GitHub on the world’s largest developer platform has been a privilege. I’m deeply grateful for the experience and opportunity to meaningfully improve developer tooling. As much as the problems were meaningful, the people made GitHub special. I’ll deeply miss working with all my GitHub friends.</p>

<p>During Nat Friedman’s departure from GitHub, he said to consider work at GitHub as “being of service to all developers.” To past, present, and future GitHub devs, thank you for serving all developers and pushing the platform forward. 🙇‍♂️</p>

<h3 id="whats-next">What’s next</h3>

<p>After 10 years specializing in code intelligence, I’m excited to join <a href="https://nuanced.dev">Nuanced</a> to help apply program analysis and static analysis techniques to improve code generation workflows for AI.</p>

<p>Effective, efficient code context is a significant, unsolved problem, especially for large, complex codebases. I’m excited to bring everything I’ve learned at GitHub to help tackle it.</p>
]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Animating Generative AI Text with Promises and the Y-Combinator</title>
      <link>https://rickwinfrey.com/writings/animating-generative-text</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/animating-generative-text</guid>
      <pubDate>Sat, 15 Apr 2023 21:10:07 +0000</pubDate>
      
      <description>Recursive promises, a Y-combinator, and an AI-generated poem walk into a browser—this post is what happens next.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<p>
  The core concept was simple: take some AI-generated text, iterate over its characters or words, and animate each by fading it from fully transparent to fully opaque in the DOM.
  To make things more interesting, I created an abstraction over JavaScript's <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise">Promise constructor</a>, using a series of combinators. The result is a sort of mini-promise engine for declarative chaining that uses the <a href="https://en.wikipedia.org/wiki/Fixed-point_combinator#What_is_a_.22combinator.22.3F">Y-combinator</a> for recursion.
  This has no practical purpose beyond satisfying my curiosity about JavaScript promises. (Note: this code is not intended for production use!)
</p>

<p>
  For this experiment, I created three animations described below. Some technical details about the implementation follow after the animations.
</p>

<hr />

<h3>Character-by-character animation</h3>

<div class="demo-columns">
  <div class="demo-description">
    <p>
      This animation uses a character-by-character iterator. The "Delay" slider controls the amount of delay (in milliseconds) between each iteration of the input string representing text produced by a generative AI system. In this case, the input string is a poem created by a LLM.
    </p>

    <p>
      On each iteration a new character is added to a queue. The "Opacity steps" slider controls the number of "steps" a character's opacity is updated from 0 (transparent) to 100 (opaque). Each character’s opacity increases on every animation tick until it reaches 100%, at which point it’s committed to the DOM and removed from the queue. At that point the character is "set" in the DOM without any further updates and is dropped from the iterator's queue.
    </p>

    <p>
      Feel free to experiment with different values of "Delay" and "Opacity steps". More delay means a longer pause before the next iteration, and provides the illusion of a slower overall animation of the input string. More opacity steps adjust a variable opacity value added to each character per iteration, and is equivalent to <code>character.opacity += steps / 100</code>, and provides the illusion of a slower "fade-in" of each character.
    </p>
  </div>
  <div class="demo-block">
    <div class="controls-wrapper">
      <div id="controls-char">
        <div class="slider-row">
          <label class="slider-label" for="steps-char">Opacity steps</label>
          <input type="range" id="steps-char" min="1" max="100" value="20" />
          <span id="stepsValue-char" class="slider-value">20</span>
        </div>

        <div class="slider-row">
          <label class="slider-label" for="delay-char">Delay</label>
          <input type="range" id="delay-char" min="0" max="100" value="5" />
          <span id="delayValue-char" class="slider-value">5</span>
        </div>

        <div class="controls">
          <button id="play-char" class="control" aria-label="Play" title="Play">
            <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
          </button>
          <button id="pause-char" class="control" aria-label="Pause" title="Pause">
            <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
          </button>
          <button id="clear-char" class="control" aria-label="Clear" title="Clear">
            <span class="fa-solid fa-trash" aria-hidden="true"></span>
          </button>
        </div>
      </div>
    </div>

    <div class="demo-container-wrapper">
      <div class="demo-container">
        <p id="char"></p>
      </div>
      <div class="demo-paused-overlay paused" id="overlay-char">
        <button class="play-btn" aria-label="Play">
          <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
        </button>
        <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
      </div>
    </div>
  </div>
</div>

<hr />

<h3>Word-by-word animation</h3>

<div class="demo-columns">
  <div class="demo-description">
    <p>
      This animation works like the character-based one but iterates over words instead of individual characters.
    </p>

    <p>
      In this animation, the "Delay" and "Opacity steps" sliders are more impactful compared with the character-by-character animation.
    </p>

    <p>
      Also feel free to experiment with the different buttons. The animation plays by default, but you can pause and resume it at any time. This behavior is a byproduct of the underlying implementation, which uses anonymous recursive promises to chain the iteration and animation updates. The "Clear" button stops the animation and resets the animation state and clears the output stream.
    </p>
  </div>

  <div class="demo-block">
    <div class="controls-wrapper">
      <div id="controls-word">
        <div class="slider-row">
          <label class="slider-label" for="steps-word">Opacity steps</label>
          <input type="range" id="steps-word" min="1" max="100" value="20" />
          <span id="stepsValue-word" class="slider-value">20</span>
        </div>

        <div class="slider-row">
          <label class="slider-label" for="delay-word">Delay</label>
          <input type="range" id="delay-word" min="0" max="100" value="20" />
          <span id="delayValue-word" class="slider-value">20</span>
        </div>

        <div class="controls">
          <button id="play-word" class="control" aria-label="Play" title="Play">
            <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
          </button>
          <button id="pause-word" class="control" aria-label="Pause" title="Pause">
            <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
          </button>
          <button id="clear-word" class="control" aria-label="Clear" title="Clear">
            <span class="fa-solid fa-trash" aria-hidden="true"></span>
          </button>
        </div>
      </div>
    </div>

    <div class="demo-container-wrapper">
      <div class="demo-container">
        <p id="word"></p>
      </div>
      <div class="demo-paused-overlay paused" id="overlay-word">
        <button class="play-btn" aria-label="Play">
          <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
        </button>
        <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
      </div>
    </div>
  </div>
</div>

<hr />

<h3>Word-by-word with probability animation</h3>

<div class="demo-columns">
  <div class="demo-description">
    <p>
      This animation indicates the probability of the generating word as it is added to the output stream.
      It also reveals the word's probability on hover.
    </p>
  </div>

  <div class="demo-block">
    <div class="controls-wrapper">
      <div id="controls-word-probability">
        <div class="slider-row">
          <label class="slider-label" for="steps-word-probability">Opacity steps</label>
          <input type="range" id="steps-word-probability" min="1" max="100" value="20" />
          <span id="stepsValue-word-probability" class="slider-value">20</span>
        </div>

        <div class="slider-row">
          <label class="slider-label" for="delay-word-probability">Delay</label>
          <input type="range" id="delay-word-probability" min="0" max="100" value="20" />
          <span id="delayValue-word-probability" class="slider-value">20</span>
        </div>

        <div class="controls">
          <button id="play-word-probability" class="control" aria-label="Play" title="Play">
            <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
          </button>
          <button id="pause-word-probability" class="control" aria-label="Pause" title="Pause">
            <span class="fa-solid fa-circle-pause" aria-hidden="true"></span>
          </button>
          <button id="clear-word-probability" class="control" aria-label="Clear" title="Clear">
            <span class="fa-solid fa-trash" aria-hidden="true"></span>
          </button>
        </div>
      </div>
    </div>

    <div class="demo-container-wrapper">
      <div class="demo-container">
        <p id="wordWithProbability" class="demo-text-probability-parent"></p>
      </div>
      <div class="demo-paused-overlay paused" id="overlay-word-probability">
        <button class="play-btn" aria-label="Play">
          <span class="fa-solid fa-circle-play" aria-hidden="true"></span>
        </button>
        <span class="paused-message">Animation paused on mobile. Tap play to start.</span>
      </div>
    </div>
  </div>
</div>

<hr />

<h3>Animating with promises</h3>

<p>
  This experiment started by creating a lightweight combinator abstracting over JavaScript's <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise">Promise constructor</a> (i.e. <code>new Promise((resolve, reject) => ...)</code>). My motivation was to both explore animating text streamed from an imaginary generative AI system, and learn about JavaScript promises. My goal for this collection of combinators was to provide a declarative combinator DSL, making it possible to separate the promise-based logic and management from the animation iteration.
</p>

<h3>Combinator building blocks</h3>

<p>
  At the heart of this experiment is a collection of combinators that abstract promise construction, and resolving promises:
</p>

<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="c1">// Returns a promise that resolves after executing the given operation.</span>
<span class="kd">function</span> <span class="nx">promise</span><span class="p">(</span><span class="nx">op</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">return</span> <span class="k">new</span> <span class="nb">Promise</span><span class="p">((</span><span class="nx">resolve</span><span class="p">,</span> <span class="nx">reject</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">op</span><span class="p">(</span><span class="nx">resolve</span><span class="p">,</span> <span class="nx">reject</span><span class="p">))</span>
<span class="p">}</span>

<span class="c1">// Returns a function that resolves a promise with the given operation.</span>
<span class="kd">function</span> <span class="nx">resolveWith</span><span class="p">(</span><span class="nx">op</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">return</span> <span class="p">((</span><span class="nx">resolve</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">op</span><span class="p">(</span><span class="nx">resolve</span><span class="p">));</span>
<span class="p">}</span>

<span class="c1">// Returns a function that resolves a promise with the given operation,</span>
<span class="c1">// allowing the operation to be called recursively.</span>
<span class="kd">function</span> <span class="nx">resolveWithFix</span><span class="p">(</span><span class="nx">op</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">return</span> <span class="p">((</span><span class="nx">resolve</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">op</span><span class="p">(</span><span class="nx">op</span><span class="p">,</span> <span class="nx">resolve</span><span class="p">));</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
  The <code>resolveWithFix</code> combinator is the key to enabling recursive operations, allowing the operation to call itself with the same parameters. This mirrors the idea behind fixed-point combinators like the <a href="https://en.wikipedia.org/wiki/Fixed-point_combinator#Y_combinator">Y-combinator</a>, which allows for recursion without named functions. In this case, the fixed-point combinator enables recursion using anonymous promise-based operations instead of named functions. The <code>resolveWithFix</code> combinator allows recursive promise operations until the recurring operation fully resolves. In this context, the <code>op</code> is one of the three presented iterators. This fixed-point combinator enables stateful iteration without imperative loops.
</p>

<p>
  The combinators alone are not sufficient to create the animations. We need to define how the operations will be executed and how they will interact with the DOM. There are many possible ways to implement this, but I've chosen to focus on recursive async iteration.
</p>

<h3>Recursive iterators for animation</h3>

<p>
I wanted to push promises all the way down the stack, and created iterators that use the above combinators to manage animation state and DOM updates in which every iteration is the resolution of a single promise. The simplest of the iterators is the <code>charIterator</code> which iterates over each character in a string, updating the opacity of each character over time. The <code>wordIterator</code> does the same for words, and the <code>wordWithProbabilityIterator</code> adds a probability value to each word, revealing it on hover. Below is the <code>charIterator</code> implementation:
</p>

<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
</pre></td><td class="code"><pre><span class="kd">function</span> <span class="nx">charIterator</span><span class="p">(</span><span class="nx">target</span><span class="p">,</span> <span class="nx">input</span><span class="p">,</span> <span class="nx">externalOpts</span> <span class="o">=</span> <span class="p">{})</span> <span class="p">{</span>
  <span class="c1">// Initialize and close over the iterator state.</span>
  <span class="kd">const</span> <span class="nx">opts</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">...</span><span class="nx">externalOpts</span><span class="p">,</span>
    <span class="na">index</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
    <span class="na">input</span><span class="p">:</span> <span class="nx">input</span><span class="p">,</span>
    <span class="na">queue</span><span class="p">:</span> <span class="p">[],</span>
    <span class="na">output</span><span class="p">:</span> <span class="dl">""</span><span class="p">,</span>
    <span class="na">outputWithOpacity</span><span class="p">:</span> <span class="dl">""</span><span class="p">,</span>
    <span class="na">initialOpacity</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="c1">// Initial opacity for characters.</span>
    <span class="na">steps</span><span class="p">:</span> <span class="nx">externalOpts</span><span class="p">.</span><span class="nx">steps</span> <span class="o">??</span> <span class="mi">20</span><span class="p">,</span> <span class="c1">// Number of steps for the animation. Increasing this will make the animation appear "smoother" and its duration longer.</span>
    <span class="na">delay</span><span class="p">:</span> <span class="nx">externalOpts</span><span class="p">.</span><span class="nx">delay</span> <span class="o">??</span> <span class="mi">20</span><span class="p">,</span> <span class="c1">// Delay applied to each recursive call in milliseconds. Increasing this will make the animation appear slower.</span>
    <span class="na">status</span><span class="p">:</span> <span class="dl">"</span><span class="s2">running</span><span class="dl">"</span><span class="p">,</span>
  <span class="p">};</span>
  <span class="kd">function</span> <span class="nx">reset</span><span class="p">()</span> <span class="p">{</span>
    <span class="nx">opts</span><span class="p">.</span><span class="nx">index</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="nx">opts</span><span class="p">.</span><span class="nx">queue</span> <span class="o">=</span> <span class="p">[];</span>
    <span class="nx">opts</span><span class="p">.</span><span class="nx">outputWithOpacity</span> <span class="o">=</span> <span class="dl">""</span><span class="p">;</span>
    <span class="nx">opts</span><span class="p">.</span><span class="nx">status</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">paused</span><span class="dl">"</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="kd">var</span> <span class="nx">domCtx</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="nx">target</span><span class="p">);</span>

  <span class="kd">const</span> <span class="nx">runner</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">op</span><span class="p">,</span> <span class="nx">next</span><span class="p">)</span> <span class="p">{</span>
    <span class="nx">promise</span><span class="p">(</span>
      <span class="nx">resolveWith</span><span class="p">((</span><span class="nx">next</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">setTimeout</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
          <span class="nx">promise</span><span class="p">(</span>
            <span class="nx">resolveWith</span><span class="p">((</span><span class="nx">next</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
              <span class="nx">opts</span><span class="p">.</span><span class="nx">output</span> <span class="o">=</span> <span class="dl">""</span><span class="p">;</span>
              <span class="k">if</span> <span class="p">(</span><span class="nx">opts</span><span class="p">.</span><span class="nx">index</span> <span class="o">&lt;</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">input</span><span class="p">.</span><span class="nx">length</span><span class="p">)</span> <span class="p">{</span>
                <span class="kd">const</span> <span class="nx">char</span> <span class="o">=</span> <span class="nx">input</span><span class="p">[</span><span class="nx">opts</span><span class="p">.</span><span class="nx">index</span><span class="p">];</span>
                <span class="nx">opts</span><span class="p">.</span><span class="nx">queue</span><span class="p">.</span><span class="nx">push</span><span class="p">({</span> <span class="na">char</span><span class="p">:</span> <span class="nx">char</span><span class="p">,</span> <span class="na">opacity</span><span class="p">:</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">initialOpacity</span> <span class="p">});</span>
              <span class="p">}</span>
              <span class="k">return</span> <span class="nx">next</span><span class="p">();</span>
            <span class="p">})</span>
          <span class="p">).</span><span class="nx">then</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
            <span class="nx">opts</span><span class="p">.</span><span class="nx">queue</span> <span class="o">=</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">queue</span><span class="p">.</span><span class="nx">filter</span><span class="p">((</span><span class="nx">current</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
              <span class="k">if</span> <span class="p">(</span><span class="nx">current</span><span class="p">.</span><span class="nx">opacity</span> <span class="o">&gt;=</span> <span class="mi">100</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">if</span> <span class="p">(</span><span class="nx">current</span><span class="p">.</span><span class="nx">char</span> <span class="o">===</span> <span class="dl">'</span><span class="se">\n</span><span class="dl">'</span><span class="p">)</span> <span class="p">{</span>
                  <span class="nx">opts</span><span class="p">.</span><span class="nx">outputWithOpacity</span> <span class="o">+=</span> <span class="s2">`&lt;br&gt;`</span><span class="p">;</span>
                  <span class="k">return</span> <span class="kc">false</span><span class="p">;</span>
                <span class="p">}</span>

                <span class="nx">opts</span><span class="p">.</span><span class="nx">outputWithOpacity</span> <span class="o">+=</span> <span class="s2">`&lt;span class="set" style="opacity: </span><span class="p">${</span><span class="nx">current</span><span class="p">.</span><span class="nx">opacity</span><span class="p">}</span><span class="s2">%"&gt;</span><span class="p">${</span><span class="nx">current</span><span class="p">.</span><span class="nx">char</span><span class="p">}</span><span class="s2">&lt;/span&gt;`</span><span class="p">;</span>
                <span class="k">return</span> <span class="kc">false</span><span class="p">;</span>
              <span class="p">}</span>

              <span class="nx">current</span><span class="p">.</span><span class="nx">opacity</span> <span class="o">+=</span> <span class="p">(</span><span class="mi">100</span> <span class="o">/</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">steps</span><span class="p">);</span>
              <span class="k">if</span> <span class="p">(</span><span class="nx">current</span><span class="p">.</span><span class="nx">char</span> <span class="o">===</span> <span class="dl">'</span><span class="se">\n</span><span class="dl">'</span><span class="p">)</span> <span class="p">{</span>
                <span class="nx">opts</span><span class="p">.</span><span class="nx">output</span> <span class="o">+=</span> <span class="s2">`&lt;br&gt;`</span><span class="p">;</span>
              <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                <span class="nx">opts</span><span class="p">.</span><span class="nx">output</span> <span class="o">+=</span> <span class="s2">`&lt;span style="opacity: </span><span class="p">${</span><span class="nx">current</span><span class="p">.</span><span class="nx">opacity</span><span class="p">}</span><span class="s2">%"&gt;</span><span class="p">${</span><span class="nx">current</span><span class="p">.</span><span class="nx">char</span><span class="p">}</span><span class="s2">&lt;/span&gt;`</span><span class="p">;</span>
              <span class="p">}</span>

              <span class="k">return</span> <span class="kc">true</span><span class="p">;</span>
            <span class="p">});</span>
          <span class="p">})</span>
          <span class="k">return</span> <span class="nx">next</span><span class="p">();</span>
        <span class="p">},</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">delay</span><span class="p">);</span>
      <span class="p">})</span>
    <span class="p">).</span><span class="nx">then</span><span class="p">(()</span> <span class="o">=&gt;</span>
      <span class="nx">promise</span><span class="p">(</span>
          <span class="nx">resolveWith</span><span class="p">((</span><span class="nx">next</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
            <span class="nx">domCtx</span><span class="p">.</span><span class="nx">innerHTML</span> <span class="o">=</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">outputWithOpacity</span> <span class="o">+</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">output</span><span class="p">;</span>
            <span class="nx">opts</span><span class="p">.</span><span class="nx">index</span><span class="o">++</span><span class="p">;</span>
            <span class="k">return</span> <span class="nx">next</span><span class="p">();</span>
          <span class="p">})</span>
      <span class="p">)</span>
    <span class="p">).</span><span class="nx">then</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
      <span class="c1">// If the status is paused or canceled, we abort the recursion.</span>
      <span class="k">if</span> <span class="p">(</span><span class="nx">opts</span><span class="p">.</span><span class="nx">status</span> <span class="o">!==</span> <span class="dl">"</span><span class="s2">running</span><span class="dl">"</span><span class="p">)</span> <span class="k">return</span> <span class="nx">next</span><span class="p">();</span>

      <span class="c1">// If the index exceeds the input string length and the queue is empty, we can abort the recursion,</span>
      <span class="c1">// or we can reset the state of the iterator, and loop the animation endlessly.</span>
      <span class="k">if</span> <span class="p">(</span><span class="nx">opts</span><span class="p">.</span><span class="nx">index</span> <span class="o">&gt;=</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">input</span><span class="p">.</span><span class="nx">length</span> <span class="o">&amp;&amp;</span> <span class="nx">opts</span><span class="p">.</span><span class="nx">queue</span><span class="p">.</span><span class="nx">length</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">//return next(); // This is where we would abort.</span>
        <span class="nx">reset</span><span class="p">();</span>
        <span class="nx">opts</span><span class="p">.</span><span class="nx">status</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">running</span><span class="dl">"</span><span class="p">;</span>
        <span class="nx">domCtx</span><span class="p">.</span><span class="nx">innerHTML</span> <span class="o">=</span> <span class="dl">""</span><span class="p">;</span>
      <span class="p">}</span>

      <span class="c1">// Otherwise, we use a Y-combinator to continue the recursion.</span>
      <span class="nx">op</span><span class="p">(</span><span class="nx">op</span><span class="p">,</span> <span class="nx">next</span><span class="p">,</span> <span class="nx">opts</span><span class="p">)</span>
    <span class="p">})</span>
  <span class="p">}</span>

  <span class="k">return</span> <span class="p">{</span> <span class="nx">runner</span><span class="p">,</span> <span class="nx">opts</span><span class="p">,</span> <span class="nx">reset</span> <span class="p">};</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
  The <code>charIterator</code> is invoked using one last combinator named <code>animate</code>:
</p>

<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="c1">// Animate a target element using the provided iterator and input string.</span>
<span class="kd">function</span> <span class="nx">animate</span><span class="p">(</span><span class="nx">target</span><span class="p">,</span> <span class="nx">input</span><span class="p">,</span> <span class="nx">iterator</span><span class="p">,</span> <span class="nx">opts</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">return</span> <span class="nx">resolveWithFix</span><span class="p">(</span><span class="nx">iterator</span><span class="p">(</span><span class="nx">target</span><span class="p">,</span> <span class="nx">input</span><span class="p">,</span> <span class="nx">opts</span><span class="p">));</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
What I find interesting about this implementation is the side-effect of an iterator for which every iteration is an async operation. This asynchrony affords very precise control over the animation, and the promise chain used by the iterators allows decomposing the animation sequence into separate promise resolutions. The first promise resolution represents a single iteration, management of the work queue and its elements' states. The second promise resolution updates the iterator's index position, and updates the DOM. The third promise resolution applies the recursive step or aborts if the base cases are met. Although the iterator is essentially a sequential, imperative program, it is a good way to exercise asynchronous semantics and to separate concerns across distinct stages of the promise chain.
</p>

<h3>Why this matters</h3>

<p>
I'm a strong believer in curiosity-driven development and creative play. This experiment was both a challenging and fun dive into an unusual application of asynchronous programming, modeling iterator evaluation entirely through JavaScript promises. Surprisingly, the combinator-based approach to wrapping and resolving promises allowed for a declarative DSL for managing promises. Exploring the interplay between UI buttons as an asynchronous signal, and the interruption or resuming of a recursive iterator promise chain also revealed how asynchronous execution allows separating layers related to some computation into distinct stages or steps of an async workflow. I hope you found something here that sparked a new idea, or at least enjoyed playing with the animations.
</p>

<script type="module" src="/assets/javascripts/animating-generative-text-v2.js"></script>
]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Inject and Zip with the Y-Combinator</title>
      <link>https://rickwinfrey.com/writings/inject-and-zip-with-the-y-combinator</link>
      <guid isPermaLink="true">https://rickwinfrey.com/writings/inject-and-zip-with-the-y-combinator</guid>
      <pubDate>Fri, 20 Feb 2015 23:04:07 +0000</pubDate>
      
      <description>An exploration of how Ruby&apos;s &lt;code&gt;inject&lt;/code&gt; and &lt;code&gt;zip&lt;/code&gt; methods can be implemented using nothing but anonymous functions and the Y-combinator. A silly, elegant, mind-bending combinator journey that is slower than you&apos;d like.</description>
      
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <![CDATA[<p>
Lately I've wondered what it would be like to try to write some of Ruby's enumerable methods
using only lambdas. I previously had given a talk about the
<a href="http://en.wikipedia.org/wiki/Fixed-point_combinator#What_is_a_.22combinator.22.3F" target="_blank">
Y-combinator</a> and was already familiar with the notion of a fixed point combinator. I was
curious to see what it would take to implement Ruby's
<a href="http://ruby-doc.org/core-2.2.0/Enumerable.html#method-i-inject" target="_blank">inject</a>
method supporting the basic arithmetic operations and Ruby's
<a href="http://ruby-doc.org/core-2.2.0/Enumerable.html#method-i-zip" target="_blank">zip</a> method.
</p>

<p>
To follow this post, I'm assuming you have a comfortable familiarity with the Y-combinator.
If you're new to the Y-combinator, I would suggest first viewing an inspiring and thorough
<a href="https://www.youtube.com/watch?v=FITJMJjASUs" target="_blank">talk</a>
given by the late Jim Weirich. The overview he provides of how the Y-combinator works, along with
the explanations about the underlying concepts from
<a href="http://en.wikipedia.org/wiki/Lambda_calculus" target="_blank">lambda calculus</a>
used by <a href="http://en.wikipedia.org/wiki/Haskell_Curry" target="_blank">Haskell Curry</a>
to formally define the Y-combinator will give you a solid foundation for understanding the concept.
However, in a nutshell, the Y-combinator allows us the ability to employ recursion
using only anonymous functions.
</p>

<p>
Let's take a look at a classic example of a Y-combinator application - calculating a factorial.
You'll notice that the result is stored as a local variable, but otherwise the recursion is
handled entirely via anonymous functions via Ruby's lambda short-hand operator, the -> operator.
</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="c1"># Find the factorial of 5</span>
<span class="mi">5</span> <span class="o">*</span> <span class="mi">4</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mi">1</span> <span class="o">=</span> <span class="mi">120</span>

<span class="c1"># Y-combinator solution</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">number</span><span class="p">)</span> <span class="p">{</span> <span class="n">builder</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">number</span><span class="p">)</span> <span class="p">}.</span><span class="nf">call</span><span class="p">(</span>
  <span class="o">-&gt;</span> <span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">number</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">number</span> <span class="o">==</span> <span class="mi">0</span>
    <span class="k">return</span> <span class="n">number</span> <span class="o">*</span> <span class="n">recurse</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">number</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
  <span class="p">},</span>
  <span class="mi">5</span>
<span class="p">)</span>

<span class="n">result</span> <span class="c1"># =&gt; 120</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
Given that we have this example of a factorial Y-combinator, let's take a look at what
inject looks like given the constraint of only using anonymous functions.
</p>


<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="c1"># Ruby inject example</span>
<span class="p">[</span><span class="o">*</span><span class="mi">0</span><span class="o">..</span><span class="mi">10</span><span class="p">].</span><span class="nf">inject</span><span class="p">(</span><span class="o">&amp;</span><span class="p">:</span><span class="o">+</span><span class="p">)</span> <span class="c1"># =&gt; 55</span>

<span class="c1"># Y-combinator solution</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">limit</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="n">builder</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">limit</span><span class="p">,</span> <span class="n">count</span><span class="p">)</span> <span class="p">}.</span><span class="nf">call</span><span class="p">(</span>
  <span class="o">-&gt;</span> <span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">limit</span><span class="p">,</span> <span class="n">count</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span> <span class="k">if</span> <span class="n">count</span> <span class="o">&gt;</span> <span class="n">limit</span>
    <span class="k">return</span> <span class="n">count</span> <span class="o">+</span> <span class="n">recurse</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">limit</span><span class="p">,</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
  <span class="p">},</span>
  <span class="mi">10</span>
<span class="p">)</span>

<span class="n">result</span> <span class="c1"># =&gt; 55</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
This Y-combinator gives us the correct result, but there are limitations with this  solution.
The first is that we are required to pass in a limit as an input parameter rather than
a collection or a range with arbitrary start and stop points. This solution
is also only capable of addition. Should we choose to implement inject-like functionality
using multiplication instead, we would have to duplicate most of this solution.
Let's make our Y-combinator solution a little more sophisticated, so that it can handle a
range or collection as an input parameter, and also support any of the basic arithmetic operations.
</p>


<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="code"><pre><span class="c1"># Ruby inject example</span>
<span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">].</span><span class="nf">inject</span><span class="p">(</span><span class="o">&amp;</span><span class="p">:</span><span class="o">*</span><span class="p">)</span> <span class="c1"># =&gt; 1000</span>

<span class="c1"># Y-combinator solution</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">)</span> <span class="p">{</span> <span class="n">builder</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">)</span> <span class="p">}.</span><span class="nf">call</span><span class="p">(</span>
  <span class="o">-&gt;</span> <span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">==</span> <span class="n">range</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
      <span class="k">return</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">].</span><span class="nf">send</span><span class="p">(</span><span class="n">operator</span><span class="p">,</span> <span class="n">operator</span> <span class="o">==</span> <span class="p">:</span><span class="o">*</span> <span class="o">||</span> <span class="n">operator</span> <span class="o">==</span> <span class="ss">:/</span> <span class="p">?</span> <span class="mi">1</span> <span class="p">:</span> <span class="mi">0</span><span class="p">)</span>
    <span class="k">else</span>
      <span class="k">return</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">].</span><span class="nf">send</span><span class="p">(</span><span class="n">operator</span><span class="p">,</span> <span class="n">recurse</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">,</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
    <span class="k">end</span>
  <span class="p">},</span>
  <span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span>
  <span class="p">:</span><span class="o">*</span>
<span class="p">)</span>

<span class="n">result</span> <span class="c1"># =&gt; 1000</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
The approach now takes in a range or collection and an operator as input parameters.
In order to support a generic Y-combinator that allows for any arithmetic operation, we make
use of the various arithmetic identities. Given Ruby's Smalltalk roots, we can send
the operator message to each of the elements in the provided range or collection. To prove
this works, I've included examples of subtraction and division below.
</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
</pre></td><td class="code"><pre><span class="c1"># Ruby inject example with subtraction</span>
<span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">].</span><span class="nf">inject</span><span class="p">(</span><span class="o">&amp;</span><span class="p">:</span><span class="o">-</span><span class="p">)</span> <span class="c1"># =&gt; 90</span>

<span class="c1"># Y-combinator solution</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">)</span> <span class="p">{</span> <span class="n">builder</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">)</span> <span class="p">}.</span><span class="nf">call</span><span class="p">(</span>
  <span class="o">-&gt;</span> <span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">==</span> <span class="n">range</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
      <span class="k">return</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">].</span><span class="nf">send</span><span class="p">(</span><span class="n">operator</span><span class="p">,</span> <span class="n">operator</span> <span class="o">==</span> <span class="p">:</span><span class="o">*</span> <span class="o">||</span> <span class="n">operator</span> <span class="o">==</span> <span class="ss">:/</span> <span class="p">?</span> <span class="mi">1</span> <span class="p">:</span> <span class="mi">0</span><span class="p">)</span>
    <span class="k">else</span>
      <span class="k">return</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">].</span><span class="nf">send</span><span class="p">(</span><span class="n">operator</span><span class="p">,</span> <span class="n">recurse</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">,</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
    <span class="k">end</span>
  <span class="p">},</span>
  <span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span>
  <span class="p">:</span><span class="o">-</span>
<span class="p">)</span>

<span class="n">result</span> <span class="c1"># =&gt; 90</span>

<span class="c1"># Ruby inject example with division</span>
<span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">].</span><span class="nf">inject</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:/</span><span class="p">)</span> <span class="c1"># =&gt; 10</span>

<span class="c1"># Y-combinator solution</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">)</span> <span class="p">{</span> <span class="n">builder</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">)</span> <span class="p">}.</span><span class="nf">call</span><span class="p">(</span>
  <span class="o">-&gt;</span> <span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">==</span> <span class="n">range</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
      <span class="k">return</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">].</span><span class="nf">send</span><span class="p">(</span><span class="n">operator</span><span class="p">,</span> <span class="n">operator</span> <span class="o">==</span> <span class="p">:</span><span class="o">*</span> <span class="o">||</span> <span class="n">operator</span> <span class="o">==</span> <span class="ss">:/</span> <span class="p">?</span> <span class="mi">1</span> <span class="p">:</span> <span class="mi">0</span><span class="p">)</span>
    <span class="k">else</span>
      <span class="k">return</span> <span class="n">range</span><span class="p">[</span><span class="n">count</span><span class="p">].</span><span class="nf">send</span><span class="p">(</span><span class="n">operator</span><span class="p">,</span> <span class="n">recurse</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">range</span><span class="p">,</span> <span class="n">operator</span><span class="p">,</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
    <span class="k">end</span>
  <span class="p">},</span>
  <span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span>
  <span class="ss">:/</span>
<span class="p">)</span>

<span class="n">result</span> <span class="c1"># =&gt; 10</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
Now let's take a look at Ruby's zip method. This one is a little more interesting because
it involves two collections.
</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="c1"># Ruby zip example</span>
<span class="p">[</span><span class="o">*</span><span class="mi">0</span><span class="o">..</span><span class="mi">5</span><span class="p">].</span><span class="nf">zip</span><span class="p">([</span><span class="o">*</span><span class="mi">100</span><span class="o">..</span><span class="mi">105</span><span class="p">])</span> <span class="c1"># =&gt; [[0, 100], [1, 101], [2, 102], [3, 103], [4, 104], [5, 105]]</span>

<span class="n">result</span> <span class="o">=</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="n">builder</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span><span class="p">)</span> <span class="p">}.</span><span class="nf">call</span><span class="p">(</span>
  <span class="o">-&gt;</span> <span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="p">[]</span> <span class="k">if</span> <span class="n">count</span> <span class="o">==</span> <span class="n">collection1</span><span class="p">.</span><span class="nf">count</span>
    <span class="k">return</span> <span class="n">recurse</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span><span class="p">).</span><span class="nf">push</span><span class="p">([</span><span class="n">collection1</span><span class="p">[</span><span class="n">collection1</span><span class="p">.</span><span class="nf">count</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">count</span><span class="p">],</span> <span class="n">collection2</span><span class="p">[</span><span class="n">collection2</span><span class="p">.</span><span class="nf">count</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">count</span><span class="p">]])</span>
  <span class="p">},</span>
  <span class="p">[</span><span class="o">*</span><span class="mi">0</span><span class="o">..</span><span class="mi">5</span><span class="p">],</span>
  <span class="p">[</span><span class="o">*</span><span class="mi">100</span><span class="o">..</span><span class="mi">105</span><span class="p">]</span>
<span class="p">)</span>

<span class="n">result</span> <span class="c1"># =&gt; [[0, 100], [1, 101], [2, 102], [3, 103], [4, 104], [5, 105]]</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
I ran into a problem initially with implementing zip as a Y-combinator. Ruby does not have a
cons function similar to cons found in Lisp or Scheme. I reasoned I could create an accumulator
variable and pass it along through each recursive call, but the mutation of that accumulator
variable is troubling. I reasoned that the base case of the recursive function would instead
return an empty array, to which the zipped tuples of both collections would be added.
</p>

<p>
Just for fun I was curious to see how the Y-combinator zip solution profiled against Ruby's
zip method. The results are unsurprising. Ruby's native zip method is far better optimized
than recursive lambda calls.
</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</pre></td><td class="code"><pre><span class="nb">require</span> <span class="s1">'benchmark'</span>

<span class="n">native_result</span> <span class="o">=</span> <span class="no">Benchmark</span><span class="p">.</span><span class="nf">measure</span> <span class="k">do</span>
  <span class="mi">10000</span><span class="p">.</span><span class="nf">times</span> <span class="k">do</span>
    <span class="p">[</span><span class="o">*</span><span class="mi">0</span><span class="o">..</span><span class="mi">100</span><span class="p">].</span><span class="nf">zip</span><span class="p">([</span><span class="o">*</span><span class="mi">0</span><span class="o">..</span><span class="mi">100</span><span class="p">])</span>
  <span class="k">end</span>
<span class="k">end</span>

<span class="n">native_result</span><span class="p">.</span><span class="nf">real</span> <span class="c1"># =&gt; 0.177837</span>

<span class="n">y_combinator_result</span> <span class="o">=</span> <span class="no">Benchmark</span><span class="p">.</span><span class="nf">measure</span> <span class="k">do</span>
  <span class="mi">10000</span><span class="p">.</span><span class="nf">times</span> <span class="k">do</span>
    <span class="o">-&gt;</span> <span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="n">builder</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">builder</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span><span class="p">)</span> <span class="p">}.</span><span class="nf">call</span><span class="p">(</span>
      <span class="o">-&gt;</span> <span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="p">[]</span> <span class="k">if</span> <span class="n">count</span> <span class="o">==</span> <span class="n">collection1</span><span class="p">.</span><span class="nf">count</span>
        <span class="k">return</span> <span class="n">recurse</span><span class="p">.</span><span class="nf">call</span><span class="p">(</span><span class="n">recurse</span><span class="p">,</span> <span class="n">collection1</span><span class="p">,</span> <span class="n">collection2</span><span class="p">,</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span><span class="p">).</span><span class="nf">push</span><span class="p">([</span><span class="n">collection1</span><span class="p">[</span><span class="n">collection1</span><span class="p">.</span><span class="nf">count</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">count</span><span class="p">],</span> <span class="n">collection2</span><span class="p">[</span><span class="n">collection2</span><span class="p">.</span><span class="nf">count</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">count</span><span class="p">]])</span>
      <span class="p">},</span>
      <span class="p">[</span><span class="o">*</span><span class="mi">0</span><span class="o">..</span><span class="mi">100</span><span class="p">],</span>
      <span class="p">[</span><span class="o">*</span><span class="mi">0</span><span class="o">..</span><span class="mi">100</span><span class="p">]</span>
    <span class="p">)</span>
  <span class="k">end</span>
<span class="k">end</span>

<span class="n">y_combinator_result</span><span class="p">.</span><span class="nf">real</span> <span class="c1"># =&gt; 0.61685</span>

<span class="p">(</span><span class="n">native_result</span><span class="p">.</span><span class="nf">real</span> <span class="o">-</span> <span class="n">y_combinator_result</span><span class="p">.</span><span class="nf">real</span><span class="p">)</span> <span class="o">/</span> <span class="n">native_result</span><span class="p">.</span><span class="nf">real</span><span class="p">.</span><span class="nf">to_f</span> <span class="o">*</span> <span class="mi">100</span> <span class="c1"># =&gt; -246%</span>
</pre></td></tr></tbody></table></code></pre></figure>

<p>
Although the Y-combinator zip method is roughly 246% less performant than Ruby's native zip method,
the exercise of implementing two of Ruby's enumerable methods using Y-combinators was good fun.
This post was originally inspired by Tom Stuart's amazing <a href="http://codon.com/programming-with-nothing" target="_blank">programming with nothing</a>,
and is also available as a <a href="http://rubymanor.org/3/videos/programming_with_nothing/" target="_blank">talk</a>.
</p>
]]>
      </content:encoded>
    </item>
    
  </channel>
</rss>
