I am fascinated by mathematics in a “wow, that’s a cool result, how’d they work that out?” sort of way. One of my favourite blogs is Good Math, Bad Math, to which I’m drawn like a moth. The guy is brilliant, but sometimes I’m just more than a little sorry for the dismembered victims of his razor sharp mind. Great spectator sport, but it could be me…
So when I spotted EMC blogger Scott Waterhouse using statistics from an IBM paper on RAID-5 vs RAID-6 reliability, first thought was; hey, that’s wrong! But I’m no whizz with the numbers, so I thought I’d employ the biggest brain we have in NetApp on the subject; one of the authors of the original NetApp paper, Jon Elerath. One of the sharpest knives in the drawer.
Over to Jon:
Keep in mind that the model’s 6% number depends on the scrub rate (how long it takes to correct data corrupted from media issues). I assumed 168 hours (1 week). If we or EMC uses half that (say, 84 hours), the probability of a RAID group failing in 5 years drops to about half that, or 3%.
The original data was from 2006, but currently we are seeing much lower failure rates for operational failures. It’s a figure of about half of what I saw back then.
These two reductions give a number in the order of a quarter of 6%, or 1.5%.
Also remember that as data is read, data corrupted by media defects can be corrected by the reader (say, the OS), thereby eliminating the latent defect. So, on an absolute basis, the real number of failures depends on inherent HDD operational failure rates, the scrub rate and how smart the OS is in dealing with latent defects, something that varies amongst competitive systems.
And as I pointed out before when this all started, RAID-VTL isn’t like other RAID-4 or RAID-5. The RAID group gets shut down until the rebuild -- just on written data -- is done, reducing the risk even further.
Jon again:
But that wasn’t really the point of the model I developed. The real importance was to compare the relative values of single parity RAID to dual parity. Using the same distributions for both models shows that RAID-DP has a lower probability of failure, by about 4000-5000 times.
If Scott is even half way to the truth with his claim that NetApp’s VTL (now with deduplication) is 100% likely to fail in a five year period (although he’s corrected the more obvious errors now), then here’s the kicker.
Every EMC CLARiiON or Symmetrix system implemented with RAID5 is more at risk.
Serious stuff, Scott. I can only suggest getting StorageZilla to bless RAID-6 for Tier 1 Oracle applications if I were you. Perhaps not. Well, at least RAID-10, and then we can discuss the NetApp Space Guarantee and really make a meal of it.
.


