This is a more technical/philosophical post than my usual commentary, forewarned is forearmed.
Earlier I wrote about the notion that software is like group theory. What I said was that there exist abstractions that form in a software system over time, and that the key to understanding and improving a software system is to understand those abstractions.
I also argued, that those abstractions were not accidental, but were actual manifestations in software of how the world really worked. By making the case that these were real abstractions and not artificial constructs, I argued that changing or fighting them was done at your peril.
What's important, of course, is that any mathematical, and therefore by extension, software abstraction is an approximation of the world, not the real thing. And that approximation may help you solve some problems better than others ...
When we argue about software and hardware systems in the corporate storage blog-o-sphere we are arguing about specific notions of how the world is actually constructed. I believe part of the acrimony and bitterness is that we're not just puffing our chests and cheering our team on, but we actually have different wold views.
Enough with the blah, blah, architecture, blah, group theory, blah
My recent posts about WAFL reminded me of a theory Prof. David Cheriton repeated in his Stanford course on distributed systems. What Prof. Cheriton observed was that many times innovation happens when, by exploiting a computer trend, you relax a previously hard constraint making a previously hard thing easy.
The example he used, if my increasingly faulty memory serves me right, was the use of HTML/XML and ASCII to serialize or marshall and then deserialize or unmarshall parameters.
A canonical problem in distributed system design is how to transport data over the wire. The problem is that data is represented as a stream of binary digits in main memory. The only thing that gives it context is the structure software imposes on the data.
The problem in a distributed system is that the software can be written in two different programming languages running on different hardware meaning that there has to be well understood language and hardware independent rules about what the stream of bits mean.
For example, suppose I have remote procedure call int foo(int b). I need to convert the input parameter b into some format that can be transported over the wire, and interpreted at the other side as an integer with the same value. This transformation is what is known as marshalling and unmarshalling, I'll call it translation.
The traditional approach to this problem was to use a binary protocol that encoded the integer into a system independent format as you went over the wire and then translated the binary data into the system dependent format at the point where it needed to be interpreted. Software on either side of the wire using the translator could send and receive data.
Building such and infrastructure that is robust is a relatively hard but straightforward problem.
Building such an infrastructure that is widely used by everyone was impossible.
The problem was that there was no industry standard for the translators. For computer programs to talk to each other over the wire they needed proprietary and expensive binary translators. CORBA, for example, was a failed attempt at creating such a standard.
Enter HTML/XML
HTML/XML are horribly inefficient compared to a binary translator. HTML/XML are inefficient with their use of bandwidth and their use of compute resources. For example a four byte integer quantity can turn into an 11 byte string, plus the meta-data required to tag the value.
But HTML/XML do not require a proprietary binary translator on either side of the wire.
So the idea was to relax your requirements for performance to get ease of use. And that was okay because CPUs were becoming faster, memory was becoming cheaper and bandwidth was becoming more plentiful.
The net result was that HTTP+HTML/XML became the defacto mechanism for programs to communicate to each other over the internet and things like ONC RPC/CORBA/JAVA RMI/DCOM died or are dying a slow and increasingly painful death.
But if that's too abstract and confused...
Computer programming languages are another good example.
Programs written in C can be faster and more efficient than programs written in Java. But the time to develop, debug and qa those programs is vastly greater than an equivalent program written in Java.
By trading off computational efficiency (memory and CPU resources), programmers are made more efficient, resulting in more software being developed.
I am storage architect not a computer science historian...
When Dave Hitz and James Lau architected WAFL they relaxed the fundamental constraint that file systems had preserved for a long time, namely that the client notion of disk layout actually corresponded to the on-disk-layout.
This was a profound decision. By relaxing that constraint, WAFL had taken on the job of providing the illusion to clients that the disk-layout was sequential even if it wasn't.
At the time that might have appeared to be a ridiculous thing to do. And given the reaction of the other guys (vendors of Traditional Legacy Arrays) still seams to be a ridiculous thing to do.
But...
If you looked at the computer trends, maybe it wasn't. By exploiting the increasing performance of CPU's, the increasing density of memory, it became possible to write smarter and smarter algorithms that were more and more capable of providing the illusion that the disk layout was sequential even if it wasn't (for more on this topic see an earlier post)
And by relaxing that constraint, it became possible to deliver innovation like thin provisioning, Real Snapshots and deduplication without compromising performance.
Why?
All of these features, primary dedup, Real Snapshots, high performance RAID-6, require an on-disk layout that is different from the client layout. For our, NetApp, internal abstractions, these are easy to do and a natural extension of our software abstractions. And modifying your system in natural ways is easy only if your abstractions easily map to those abstractions.
For Traditional Storage Arrays, fixed in their notions of disk layout, asking them to support a model where the on-disk layout is different from the client layout challenges the fundamental abstractions they have evolved over the last 15 years. And so they try and create approximations, like pudgy provisioning, but ultimately the extensions looks like an ill fitting suit, because they are.

Kostadis --
Great insights! As a former CORBA programmer, CORBA failed for a multitude of reasons:
First, CORBA was a “design by committee” standard (full of pre-existing proprietary implementations that were shoe-horned into unwieldy “kitchen sink” specifications). Its poor performance, lack of thread support, versioning, security, and language mappings were definitely compounding factors to its demise.
The other reason really has nothing to do with technology at all: Very few programmers could wrap their brain around CORBA. Heck, most programmers don't know preorder traversal from inorder, can’t recognize security exploits (buffer overflows, race conditions, etc.), never heard of functional languages (OCaml, Erlang, etc.), and can't coherently articulate the nuances of message passing, parallelism, and so forth.
In fact, very few are “alpha geeks”. Most aren't even geeks, period – they’re just punching the clock.
I agree. CORBA / ONC RPC / RMI / DCOM are analogous to legacy “Real Fiber Channel” arrays (ha!), while NetApp is “Better than Real Fiber Channel”.
Love it!!
Brian Mitchell
NetApp Tech Lead, Arrow ECS
www.ntapgeek.com
Posted by: Brian Mitchell | May 26, 2009 at 12:29 PM
Brian,
Thanks for the comment!
Yeah, I know about the whole design by committee.
I remember looking at the first draft of the first spec and rolling my eyes in terror...
kostadis
Posted by: kostadis | May 26, 2009 at 03:29 PM