In part 2 of the DAS disruption series I described the Unix server market collapse. In this companion post I want to drill into that collapse because of the crucial role storage played.
In the 1980's UNIX workstation vendors began to chase ever increasing performance with ever increasing sophisticated system design. At the time two fateful decisions were made. The first was to abandon CISC commodity processor architectures in favor of their own RISC designs, and the second was to put more processors into each box.
Of the two decisions, it was, perhaps, the decision to build increasingly larger SMP systems that was more fateful.
To build a large multi-processor system, it is necessary to build a memory bus that has sufficient bandwidth. In addition to the memory bandwidth, applications expected a symmetric memory behavior, and the electronics necessary to make that work just add to the cost.
In many ways, what the system vendors did was instead of relying on external commodity network infrastructure to hook up processors, they built their own custom networks inside of their own sheet metal. In the 1980's and 1990's they had no choice. The commodity network infrastructure was simply too slow.
But the consequence of both decisions was to increase the total system cost.
As system performance increased, another trend emerged in the UNIX market, increasing system reliability.
Although nowadays UNIX has a well deserved reputation for 6 9's reliability, in the 1980's the UNIX Hater's Handbook had testimonials that mocked UNIX reliability. The authors remarked that it was a good thing that the reboot loop was fast, because after all the system rebooted a lot.
What does this mean?
As systems increased in performance, and increased in reliability an increasingly larger share of the compute requirements of customers were being serviced by a smaller set of big systems. Ironically, the vendors that began life to compete with the monolithic big mainframe had built a mainframe.
As a result making these big systems robust became increasingly more important. Every minute of downtime could cost millions of dollars to a company that depended on a few of these machines.
Enter RAID
As the UNIX vendors began to build increasingly more robust systems, it became apparent that disk drives were going to be a performance and reliability bottleneck.
Although disk drives are remarkable feats of mechanical and materials engineering they suffer from two distinct technology challenges. The first is that they are mechanical devices, and therefore suffer the wear and tear of mechanical devices. The second is that because they are mechanical devices their performance does not track Moore's law.
From a UNIX server vendor this created two problems. If you needed performance you needed to combine many of the drives and spread your workload across the drives, but the more disk drives you spread your workload across the increasing likelihood that a single disk drive would take down your system. If you needed capacity you had the same problem. Increasing disk drives gave you more capacity, but reduced reliability.
An obvious solution to the reliability problem was to mirror the disk drives. The problem with mirroring, if you weren't a disk drive manufacturer, was the cost.
Thankfully David Patterson and Randy Katz squared the circle with RAID and made it possible to have both reliability and performance and capacity at a fraction of the cost of mirroring.
What RAID enabled, along with the decline in the cost of compute infrastructure, and the increase in network speeds was the emergence of shared storage.
Let's be precise, adding a shared storage device increases cost because you have to add networking infrastructure and compute infrastructure. That cost can only be justified if that infrastructure is adding tremendous value or it's taking out cost. RAID made it possible for the storage vendors to do both: add value and take out cost.
NAS and SAN vs DAS in the 1990's
In the 1990's NAS and SAN disrupted DAS. The reasons are many, but basically boil down to the fact that EMC SAN and NetApp NAS performed better than the alternatives, either DAS or UNIX file servers.
As the EMC and NetApp devices proliferated, it became clear that they also enabled unique functionality that could solve real business problems.
But we were talking about servers...
If you remember the mental picture I drew earlier, the UNIX system vendors were building these increasingly big systems and making those systems increasingly robust.
What data center architects, and computer visionaries realized was that the UNIX system, like the mainframe, had absorbed lots of commodity elements into a boutique design. As the capabilities of the commodity elements improved, then the value of the boutique elements dropped.
And in the 1990's the UNIX server market got hit by a perfect storm. Storage was outside of the UNIX system. The commodity processors were getting fast enough. The network infrastructure had gotten fast enough.
Of all of those things: CPU, network and Storage, it was,, perhaps, external storage that killed the UNIX server market.
External storage allows you to store your entire state outside of a server. So, if the state is outside of the server, then what's the point of having a highly available server? What are you, protecting? Well an obvious answer is that you don't want your applications to keep stopping and starting. But what if you could solve that problem differently? Suppose instead of having one very reliable computer system, I could have two cheaper computer systems that were less reliable individually but more reliable as a pair? That could work if and only if I didn't have to replicate the state, if only I could share the state ... and oh yes, external storage makes it possible for me to share the state.
And that's what happened.
Application reliability was solved in two ways. The availability of the state was addressed by the storage. The availability of the application was addressed through clustering. Effectively, applications were designed to be able to restart quickly from state that was stored externally to the server.
So in summary
External storage outperformed direct attached storage. External storage made it possible to protect application state outside of a server, which meant that the availability of the server was less important.
Improved commodity processor performance made the value of the RISC processors less compelling.
Improved network bandwidth made it possible to build very large compute infrastructures without requiring specialized memory buses.
All three of these elements made it possible to replace big iron with lots of smaller less reliable pieces. Except that's not strictly true. The servers were replaced with less reliable pieces, but the storage became more reliable.
The net effect, in the data center, was that server costs were being traded off against software costs, networking costs and storage costs, and given the dollar amounts involved, everyone but the boutique server vendors came out ahead.