Earlier this week I wrote about how the core architecture of Traditional Legacy Arrays limits the performance of RAID-6. Just to recap, RAID-6 requires an additional read and an additional write over and above RAID-5, limiting it’s performance and applicability.
What I didn’t do is explore the consequences of such limitations.
The first, and most obvious consequence is the proliferation of RAID-10. SATA disks are too big to be used with RAID-5. The only alternative is to use RAID-10 when you need both resiliency and performance!
This singular fact turns the economics of RAID on it’s head.
Let’s be clear, the reason RAID became so popular is because it solved two generic problems: performance and availability at a lower cost than the alternatives. The problem with disk drives is that they have a limited number of IOPS per spindle. The only way to get more IOPS is to combine more of them and spread the IOPS over the spindles. The downside is that the MTBF (mean-time-between-failures) of the disk drives makes the group of disk drives increasingly less reliable.
So if you’re a customer that has business critical application that requires availability and performance, you’re more or less screwed. To get the availability and the performance you need to 2-3x the storage. As an example, consider GFS, that uses mirroring and no RAID. GFS will have at will have at least 3 full copies of the data.
Enter RAID.
RAID is so common place now-a-days that we forget how miraculous it must have appeared to folks when it first showed up. RAID allowed the computer industry to provide resiliency and performance at a lower cost! So for example, using RAID-5, you only need 1/5 (4+1) additional storage to get terrific data resiliency compared to the 3x that GFS requires.
Enter Big Drives
The problem with RAID-5 and RAID-4 is that the probability of various kinds of failures increases as the disk drive capacity increases. No surprise.
NetApp pioneered the use of RAID-6 with our FAS systems, and because of WAFL we are able to preserve the RAID value proposition of increasing capacity and performance with modest increase in physical capacity.
But we’re not the market.
As much as I would love to be the top-dog in the storage industry, the Traditional Legacy Array vendors, and in particular EMC, don’t have a very good RAID-6 implementation. Or rather, as I said earlier, their core architecture precludes a RAID-6 implementation that they can recommend for all workloads.
And because of their share of the market, they define the technology expectations for everyone who intends to use storage.
The problem is that we don’t live in a static universe. Software companies, like Microsoft, are continuously trying to reduce the total cost of their solutions. And although it’s nice to be able to bill customers for RAID-10, the reality is that for every dollar that goes into the hardware, Microsoft makes less money.
So Microsoft and the rest of the software industry, faced with the reality that the dominant Traditional Legacy Array vendors, were pushing RAID-10, or basically 2x the physical capacity for each byte of data stored, looked elsewhere for a solution.
It must have been particularly galling to software vendors that disk drive capacities were increasing, disk drive cost per GB per decreasing, but the storage infrastructure was increasing in cost reducing the total available pie.
Enter host based replication etc…
The cool thing about technology trends and curves, is that everyone is staring at them.
If I was a software application architect here’s what I would observe. The biggest cost for an application from a hardware perspective is the storage, and biggest cost factor is the disk drives.
Today with RAID-5, the picture looks like this:
The red controllers are provided by Traditional Legacy Array vendors, and there is a small increase in the total capacity to get excellent resiliency (the pink area).
But with bigger SATA drives, and the absence of a high performance RAID-6, the world looks like this:
Ouch. The total cost of the solution just jumped up! You get cheaper disk drives, but you need more of them!
Now if I am an application architect, I think to myself, you know that RAID stuff can be done in software, and I can do the mirroring at the application level, and the server vendors sell really cheap disk drives, so maybe the picture can really be made to look like this:
Each server has a pool of disk drives, and mirroring is done in the host to obtain some degree of storage resiliency. Yes the costs are greater than RAID-5, but it’s not clear they are that much higher than RAID-10 … And you might even believe they were lower than RAID-10 because you’re saving on the dollars you had to give to the storage company.
So what does this have to do with NetApp?
I really believe that RAID-10 mirroring or DAS is an inefficient way to manage disk drives.
I also believe that the popularity of DAS/Host mirroring solutions is a response to the failure of the market leaders to build an array that can preserve the RAID value proposition.
Since RAID-10 and DAS/Host mirroring are a response but the wrong fundamental approach, I believe that architectures, like NetApp’s, that make RAID-6 work will ultimately win out.
And oftentimes in the debate about DAS vs shared storage, we confuse the specific problems of the Traditional Legacy Array vendors, with the right long term technology trend.