Paper: A Study of Practical Deduplication

With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice. That's the radically simple and powerful notion behind data deduplication. If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you'll love this one, but it is a powerful feature and one I don't hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code.

Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it's possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead.

This comes up because of really good article Robin Harris of StorageMojo wrote, All de-dup works, on a paper,  A Study of Practical Deduplication by Dutch Meyer and William Bolosky,

For a great explanation of deduplication we turn to Jeff Bonwick and his experience on the ZFS Filesystem:

Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level, block-level, or byte-level. Chunks of data -- files, blocks, or byte ranges -- are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77 or, in more familiar notation, 0.00000000000000000000000000000000000000000000000000000000000000000000000000001. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.
Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage location and reference count. When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples.

What the paper is saying, for their type of data at least, is that dealing simply with files is nearly as affective as more complex block deduplication schemes.  From the paper:

We collected file system content data from 857 desktop computers  at  Microsoft over a span of 4 weeks.  We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy.  We found that  whole-file deduplication achieves about three quarters of the space savings of the most aggressive block-level deduplication for storage of live file systems, and  87% of the savings for backup images.  We also studied file fragmentation finding that it is not prevalent, and updated prior file system metadata studies, finding that the distribution of file sizes continues to skew toward very large unstructured files.

Indirection and hashing are still the greatest weapons of computer science, which leads to some telling caveats from Robin:

  • De-duplication trades direct access to save data capacity. This adds I/O latency and CPU cycles as fingerprints must be calculated and compared and reads effectively become random.

If you are dealing with a more data than you can afford, the design tradeoffs might be worth it.

Where this code should be is an interesting question. If it's at the storage layer, in the cloud, that buys you nothing because then all the cost savings simply go to the cloud provider as you will be billed for the size of the data you want to store, not the data actually stored. And all these calculations must be applied to non-encrypted data so the implication is it should be in user space somewhere. Column store databases already do this sort of thing, but it would be useful it were more general. Why should source control systems, for example, have to store diffs anymore? Why shouldn't they just store entire files and let the underlying file system or storage device figure out what's common? Why should email systems and group discussion lists implement their own compression? With the mass of unstructured data rising it seems like there's an opportunity here.