I am fascinated by mathematics in a “wow, that’s a cool result, how’d they work that out?” sort of way. One of my favourite blogs is Good Math, Bad Math, to which I’m drawn like a moth. The guy is brilliant, but sometimes I’m just more than a little sorry for the dismembered victims of his razor sharp mind. Great spectator sport, but it could be me…
So when I spotted EMC blogger Scott Waterhouse using statistics from an IBM paper on RAID-5 vs RAID-6 reliability, first thought was; hey, that’s wrong! But I’m no whizz with the numbers, so I thought I’d employ the biggest brain we have in NetApp on the subject; one of the authors of the original NetApp paper, Jon Elerath. One of the sharpest knives in the drawer.
Over to Jon:
Keep in mind that the model’s 6% number depends on the scrub rate (how long it takes to correct data corrupted from media issues). I assumed 168 hours (1 week). If we or EMC uses half that (say, 84 hours), the probability of a RAID group failing in 5 years drops to about half that, or 3%.
The original data was from 2006, but currently we are seeing much lower failure rates for operational failures. It’s a figure of about half of what I saw back then.
These two reductions give a number in the order of a quarter of 6%, or 1.5%.
Also remember that as data is read, data corrupted by media defects can be corrected by the reader (say, the OS), thereby eliminating the latent defect. So, on an absolute basis, the real number of failures depends on inherent HDD operational failure rates, the scrub rate and how smart the OS is in dealing with latent defects, something that varies amongst competitive systems.
And as I pointed out before when this all started, RAID-VTL isn’t like other RAID-4 or RAID-5. The RAID group gets shut down until the rebuild -- just on written data -- is done, reducing the risk even further.
Jon again:
But that wasn’t really the point of the model I developed. The real importance was to compare the relative values of single parity RAID to dual parity. Using the same distributions for both models shows that RAID-DP has a lower probability of failure, by about 4000-5000 times.
If Scott is even half way to the truth with his claim that NetApp’s VTL (now with deduplication) is 100% likely to fail in a five year period (although he’s corrected the more obvious errors now), then here’s the kicker.
Every EMC CLARiiON or Symmetrix system implemented with RAID5 is more at risk.
Serious stuff, Scott. I can only suggest getting StorageZilla to bless RAID-6 for Tier 1 Oracle applications if I were you. Perhaps not. Well, at least RAID-10, and then we can discuss the NetApp Space Guarantee and really make a meal of it.
.

Hey Alex;
Thanks for the civilized rebuttal. Interesting stuff here. As I have said several times in both the posts and the comments, one of the core dilemmas NetApp has is this continued espousal of RAID-DP as one the "only" safe RAID levels (in addition to RAID-10). Fine, but that leaves you with an inherent contradiction, because the VTL doesn't use RAID-DP.
And despite the evidence you provided claiming that VTL RAID is not RAID-5, I think we have adequately disposed of that (shutting down writes does nothing more that leave your rebuild time about equal to a Clariion with equivalent size drives).
So it is an issue, and here is the kicker: to do it on a VTL with deduplication is very much more dangerous than a standard array--because the failure of any one RAID group will result in the failure of the entire array. And for that reason, there is no equivalency whatsoever to a Symm or Clariion with RAID-5. (Of course, I would also further emphasize that the failure of a VTL array can expose PB of data, while the failure of a "production" RAID group will only expose a few TB, and there is presumably a backup for it in any event.) Unless you can show evidence that causing a RAID group failure wont result in total data loss when dedup is enabled?
But anyway, thanks the the dialogue, and thanks for exposing a key assumption (the 168 hours for recovery) that I either missed or is not in the original paper.
Posted by: Scott Waterhouse | October 31, 2008 at 12:51 PM
Alex:
I think NetApp ought to be consistent and just pre-announce any capability or feature they might be missing.
Don't worry about not having RAID-DP for your VTL, just pre-announce it'll be available "sometime next year" and worry about it later, just like you've done for enterprise flash, FCoE, 8Gb FC etc.
It's so much easier that way, isn't it?
-- Chuck
Posted by: Chuck Hollis | November 05, 2008 at 05:12 AM
Chuck, patience.
Perhaps you could do me a favour in return and show a little EMC consistency. I know you operate to a higher level of courtesy in your blogging, so getting StorageZilla to desist from likening me to a clown on crystal meth would be nice.
Thanks.
Posted by: Alex McDonald | November 05, 2008 at 10:00 AM
I understand where Scott is trying to take his point. I may not agree with it but I do appreciate his point of view.
Chuck's point of view is unique and entertaining and I appreciate the entertainment value. When you read an entry on Chuck's blog downplaying RAID-6 as some sort of "marketing feature" and then watch as EMC announces RAID-6 support on the DMX a few weeks later (and the Clariion a few weeks after that), you have to chuckle (no pun intended but I now have a better appreciation for the word).
Chuck didn't get the whole unified storage approach...until EMC claimed to discover it's value in August of this year with the NX4. (Kind of like saying someone from Hopkinton "discovered" Manhatten this year). Of late, Chuck has been downplaying deduplication on primary storage. NetApp customers have reported how invaluable they find this feature, particularly in their VMware environments.
Now, I can put aside EMC's attempts to market backup technology like Avamar as primary dedupe. They're just hoping to muddy the waters long enough to allow sales folks to talk until they think of something to say. But, if the Chuck-ometer is true to form, I think it's safe to say that if Chuck is downplaying the technology that EMC is about to announce said technology. I'm sure EMC will announce primary dedupe in some form pretty soon. It will come complete with the usual small print on the performance trade-offs and preferred environments and large print on how EMC has "discovered" dedupe on primary storage.
I'm sure Chuck would like to pre-announce primary dedupe but here's the rub for Chuck. If you take his blog at face value, then you'd have to say he's simply not aware of the EMC roadmap. That alone conjures up a story fit for an episode of The Office. But, come on, have to believe Chuck is aware of the EMC roadmap, don't you? (You do, right?) Anyway, I take Chuck's blog for it's non-technical entertainment value. He's aware of the roadmap but to decode it, you just have to read Chuck's latest gripe-log on NetApp. Not too long ago Chuck dismissed the idea of dedupe on primary and recommended dedupe of the backup would be the most effective place for this technology, particularly for VMware environments. That tells me primary dedupe for EMC can't be that far off and I would expect them to target VMware environments. So, in Chuck's own way he already pre-announces much of the EMC reoadmap. You just have to pay attention to what really bothers him about what NetApp is shipping today. Kind of fun having a Chuck-ometer in the office. I have it right next to my Michael Scott and Dwight Schrute bobblehead dolls.
Posted by: Mike Riley | November 10, 2008 at 12:10 AM