ICYMI: Kannan Muthukkaruppan has written a great blog post on Adaptive vs. Inline/Always-on Deduplication. I thought I would share the first half here and link to the rest of the post.
Deduplication in storage is a data reduction technique that eliminates redundant copies of bit sequences. Instead of storing multiple copies of the same chunk of data, references to a single copy are stored. The process of finding duplicates in a large address space incurs performance and resource overheads. Consequently, deduplication design involves many tradeoffs in practice.
For instance, deduplication at a fine-grained (small) chunk size improves savings because of the higher likelihood of finding duplicates, but these savings come at the expense of increasing metadata and system resources consumed.
Another consideration involves the choice between inline and post-process deduplication. Inline deduplication occurs in real-time while data is written to the system, resulting in all data being deduplicated in the storage system. “Inline & Always-on” deduplication is marketed well these days, but one must consider the full impact of this choice. Since inline deduplication occurs in the critical path of writes, it imposes a performance overhead on every write. For workloads that do not deduplicate well, an inline & always-on approach to deduplication can actually hurt rather than benefit.
Read More Here