I ran across this EMC dedupe calculator today. Interested in the space savings EMC was advertising for primary storage with dedupe, thought I'd take a closer look. I was surprised by the results. The calculator asks you to input the "Size of Current Data" for seven different application categories. So I ran the calculator 7 times, each time entering 10TB into the 7 categories, one at a time. Here are the results I got:
EMC Celerra Data Duplication Calculator (Data Duplication? I sure hope that’s a typo)
Potential Savings:
Active virtual machine files - 0TB (0%) saved
Active databases and messaging files - 0TB (0%) saved
Compressed files - .2TB (2%) saved
Media files - .4TB (4%) saved
Binaries – 1.7TB (17%) saved
Office documents - 3.3TB (33%) saved
Text files - 5.3TB (53%) saved
Wait a minute, this can't be right. 4 of the 7 categories offer between 0% and 4% savings. Doesn't seem like its worth the effort. Well how about Binaries at 17% savings? But EMC only dedupes "inactive" files - which of my binaries are inactive? Not sure, so I think I'll skip that one. Office documents - bingo 33% saved - but again this only applies to inactive docs, so I guess my savings will be a little less. Finally, there are text files with a whopping 53% savings - but I probably don't have 10TB in inactive .txt files though. At this point, I am a little disappointed in EMC.
The way I see it, using Celerra dedupe in a real world scenario would produce maybe 5% overall savings on a good day. The problem you see is that the majority of user data is tied up in those darned applications that Celerra dedupe can't touch. The puzzling part is why EMC would include applications with zero percent savings in their calculator. Trying to deceive users? Nah there must be some other explanation. Guess I'll leave that one up to them to answer.
Your voice in the search for dedupe honesty-
DrDedupe

It didn't add up when I looked at their best practices either.
http://blogs.netapp.com/shadeofblue/2009/04/please-squeeze-me-part-3.html
Posted by: Alex McDonald | June 17, 2009 at 02:52 AM
Funny, because my real world customers are coming back with 35-45% savings by simply turning on the Celerra dedupe.
And, if you knew anything about unstructured data, you'd know that the bulk of it is inactive data.
So, we'll concentrate on saving our customer's space while NOT affecting production IO.
Sure, NetApp has a direct advantage doing block level dedupe; but that comes at a cost, doesn't it?
Posted by: NAS guy | June 17, 2009 at 05:13 AM
Being a deduplication expert, your observations need to be forwarded over to the FTC to let them know the market has nothing to fear from EMC's acquisition of DataDomain. It's easy to see how marketing guys get caught up in the hoopla. T
Clearly, EMC's technology doesn't work. The patents held are harmless and therefore show no risk to free and fair trade. Hence, the anti-trust claims are without foundation. Agreed?
Posted by: DanM. | June 17, 2009 at 05:32 AM
Duplication, hah. Fools.
Posted by: Leo | June 17, 2009 at 08:09 AM
DanM-
Breathe deeply, I can feel your pulse racing. I am not trying to change the world here, just pointing out some puzzling facts about EMC's dedupe calculator.
DrDedupe
Posted by: DrDedupe | June 17, 2009 at 08:32 AM
NAS guy-
Funny indeed, look forward to seeing more detail on your customers who are getting 35% savings. And yes NetApp's block-level dedupe does come at a cost, the cost of learning that SAN and NAS data can be deduped equally well given the right architecture.
DrDedupe
Posted by: DrDedupe | June 17, 2009 at 08:40 AM
Funny indeed how you rounded down to 35% right away. Scared? I have a large hospital system that saved 41% across all their data sets. And they are still using snaps several times a day and management is a breeze; in fact, non-existent. The cost equals LOW.
Meanwhile, when NetApp tries to dedupe active SAN and NAS data, the cost equals HIGH, and the architecture better be a perfect fit and musn't change in the future. Otherwise you'll end up with a very unhappy customers.
Remember, the primary function of your filers is to serve data. Don't forget that as I believe NetApp marketing has. Instead, they rather market a few advantageous data points regardless of the cost (in terms of performance and management and massively reduced snapping).
Posted by: NAS guy | June 17, 2009 at 09:04 AM
@NAS guy
If you're using your Celerra farm as a long term WORI archove (write once read infrequently), and much of your data is inactive, >24KB and http://blogs.netapp.com/.a/6a00d8341ca27e53ef011570266c9f970b-pi . That deduplicates and compresses at most to 30% with EMC dedupe. And that's the only use case apart from your rather unusual use profile that I can see for the Celerra.
Posted by: Alex McDonald | June 17, 2009 at 09:29 AM
@NAS guy,
I don't doubt that Celerra dedupe works for its niche. If a customer is saving using this - great! What we're saying is we address a far wider spectrum of use cases and can save more at far lower risk with better performance.
On average we see about 50% storage savings across all workloads. One of the hottest areas, server/desktop virtualization, we see upwards of 95% storage savings and I can show that using my disk or someone elses. (For example, we can run EMC's VMware product more efficiently on EMC disk than EMC can. That's scary.) Since dedupe leverages existing ONTAP logic, performance impact is negligible and operationally it's as simple as clicking a button. You can see the historical results in a report or graphically via Operations Manager or your Autosupport Dashboard. If you don't like the results, you can simply turn the process off. You can even un-dupe if you want. So if you can click a button and analyze a bar chart all without impacting the application, I don't see the costs as being that high.
So, NetApp addresses a far wider spectrum of use cases, therefore, saves a lot more money for customers. From what I've seen in my customer base, 1 NetApp TB is equal to 2 EMC TB today and that's with both of our respective dedupe technologies enabled. If you look at how NetApp mitigates the data growth rate going forward - again, wider specturm - this ratio improves to 1 NetApp TB to every 3 EMC TB. I have to say that this projection does include features such as Thin Provisioning, RAID-DP, and thin clones which, again, are all items which EMC advises against in most situations. (You only need to refer to their documentation). For NetApp, these are all vital strategies which we consider the ambient condition of storage, not features to be quarantined off from the rest of the data population. The main difference is we see these "file system" features as enablers and EMC views them as a disease.
Posted by: Mike Riley | June 18, 2009 at 08:11 AM
I witnessed a 19-Tile VMmark (114 VMs) test on a NetApp system. First run was without dedupe
second was with dedupe enabled. While there
were 72% space savings the performance was
exactly the same. Not sure which production IOs
should be affected here...?
Posted by: Chrigi | June 18, 2009 at 04:00 PM