« Simplifying Data Management | Main | Why Encrypt Data on Disk? »

March 21, 2006

Expect Double Disk Failures With ATA Drives


After my last blog entry, comparing our describe-the-whole-company marketing message with EMC's, I feel like writing about something more technical. Why not double disk failures in RAID arrays?

Normal RAID arrays (RAID 4 or RAID 5) protect against any single disk failure. The natural question is: How likely is a double disk failure?

There are two answers. The first is that it's pretty unlikely to get two complete disk failures at the same time. On the other hand, it is frighteningly likely to have at least one read fail during the reconstruct. If this happens, you lose two blocks of data—the block for which the read failed, and also the corresponding block on the failed drive that you were hoping to reconstruct. If you are lucky, the two blocks could be unimportant data, maybe even unallocated data, but you can't be sure.

My math says that with a four disk RAID array, based on 400GB ATA drives, you will lose data in about 10% of RAID reconstructs. Yow! That should make it clear why NetApp has double parity RAID (RAID-DP) enabled by default on all of our systems. (Some people use the term RAID-6 for any double parity RAID.)

Let's walk through the math for this four disk example. Here are Seagate's technical specs for a 400GB SATA drive. They say that the bit error rate is "1 per 1014". If one drive fails, you have to read all the data on the other three drives to do the reconstruct. The expected failure rate is the total bits read divided by the single-bit failure rate:

400,000,000,000 * 3 * 8 / 10^14 = 9.6%


Build a 16 disk RAID group, and the expected failure rate goes to 48%. (Again, this is not total data loss, but loss of two blocks of data during the RAID reconstruct.)

I believe customers should be asking hard questions of any storage vendor selling ATA drives without some form of double-error protecting RAID. To be fair, some people claim that the bit error rate of ATA is down to 1015, in which case you'd expect to lose data in about 1% of RAID reconstructs with our hypothetical four-disk array. On the other hand, many people configure arrays with more than 4 drives per RAID group, which drives the failure rate back up. For a 16 disk array, the expected failure rate would be 4.8%. Even 1% seems bad to me.

Bottom line, using ATA drives without double protecting RAID is questionable. As drives grow, I suspect that it will become a requirement even for Fibre Channel drives.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2345678/17880982

Listed below are links to weblogs that reference Expect Double Disk Failures With ATA Drives:

Comments

The comments to this entry are closed.

Subscribe to This Blog




© NetApp, Inc.  |  "Safe Harbor" Statement