In two previous blog entries I explored EMC's late entry into the deduplication market with the EMC Celerra Data Deduplication product.
I used the contents of my laptop to demonstrate that getting to 35% savings was a bit of a squeeze using EMC's technique for deduplicating.Perhaps it wasn't a good representative sample of the kind of data you might typically see on home directories, and that it might be better with real life user data.
Nothing like real life data for making a point.
So here are some real numbers from a real user. I've used anonymous customer data before, in a usable space analysis. Using some advanced technology (in this case, NetApp's Information Server 1200) I can make a much more accurate assessment of the capabilities of EMC's latest but not so greatest. In fact, for customers looking to see how much EMC deduplication can't save them, I can strongly recommend its analytic capabilities.
The conclusion? 30% deduplication on the Celerra, with some generous assumptions. See below the line for the gory details.
The real figures will be much worse, and look out for all these other caveats.
EMC dedupe:
- is for Celerra only
- disables MPFS
- does not support VMware (VMDKs)
- does not support LUNs (iSCSI or FC SAN)
- is only done at a file level.
- is targeted at only infrequently accessed (inactive) files
- only works on file sizes less than 200MB
- compression only works on file sizes greater than 24K
- has a performance impact in reading deduped and compressed files.
- restricts backups and restores to full volumes
- means file level restores have to use SavVols and CoFR (copy-on-first-write) snapshots that impact performance
What will it look like on your data? 20% space savings? Less? The EMC calculator doesn't work on actual data, just %age estimates by file type. Give it try.
NetApp provides two calculators. A deduplication calculator, like the EMC tool (except that NetApp's deduplication works across all primary data, not just NAS files), and SSET, a space saving estimation tool that uses actual data on your system. Actual results are within +/-5% of the space savings that SSET predicts.
I still think it looks like all the world to me like EMC are squeezing its customers, not your data. And, missing in action; V-Max deduplication?
The Details
EMC's "Deduplication"
Let's recap. For home directories, EMC's deduplication uses the following rules for "cold" or not frequently accessed files that pass filtering;
- If the file is less than 24K, don't compress, only deduplicate
- If the file is >200MB, don't compress or deduplicate; leave it alone
- Otherwise, compress and deduplicate
For "cold", I'm using a figure of 90% of the data; and I'm also assuming that all files between 0K and 250MB are compressible by 50%. That will make the EMC numbers much better than are achievable in practice. In reality, YMMV. You might see a smaller or larger %age of files as cold, and for certain, not all files will compress by 50%; many will not compress at all.
But as you'll see, my original 35% was generous.
Large Files Still Dominate
As I noted before, large files may be few in number, but the represent the bulk of the data. Here's a sample from a user system. This file system contains 20 million files in just under 13TB; it's CIFS home directory data. (Click to see full sized).Small files make up the bulk of the raw file count, and the large files in excess of 250MB are very few in number (less than 0.03% of the total file count).
But those large files over 250MB make up over 41% of the used space.
That leaves 59% of the data for EMC to deduplicate (well, compress first and deduplicate second).
Duplicates
Duplicate files are interesting. There are very few duplicates for files in excess of 50MB, and the savings from deduplicating those show a 1% decrease. So not deduplicating them has very little effect.
Files smaller than 50MB show an average of 13% duplicate space savings, although over 30% of files are duplicate by count. Here, it's medium sized files; most of them make up the duplicates. Small sized files are rarely duplicates. I'll err on the side of safety and use 10%.
At least EMC are deduplicating the right files!
Active Files
In terms of active files, 35% of the space was made up of active PSTs. EMC's deduplication doesn't do active files, and again, that's a big chunk of data we can't address. But most of them are in excess of 250MB and are in the "large files" category, so to remove them would be to double count, so I haven't.
The Results
Now we have some raw figures, let's go compress and deduplicate.
Taking our 13TB file system, we have
- 13TB total, but only 60% or 7.8TB is addressable files <250MB
- 90% of the 7.8TB is cold, giving 7.0TB to compress and deduplicate
- With 50% compression, 7.0TB compresses to 3.5TB
- 3.5TB file-level deduplicates by 13% to 3.1TB
- There's a total saving of 7.0TB - 3.1TB, or 3.9TB
- 13TB - 3.9TB is a 30% saving
A 30% Saving Based On Real Data
Here were my working figures and assumptions, and I was generous.
- For files <250MB
- 90% of your data is cold
- 13% of your data is duplicated
- All files compress by 50%
If we use the numbers EMC provide from their whitepaper (Achieving Storage Efficiency through EMC Celerra Data Deduplication, Jan 2009);
- For files <200MB
- 10% of your data is duplicated
- Files that compress do so between 40-50%
.
