An astute reader of my series would have observed that WAFL is able to achieve its surreal write performance because WAFL does not guarantee that two disk blocks are sequential even if the application writer assumed they were sequential. So for example, if the application thinks that it’s writing two disk blocks in sequential order, they may not appear sequentially on disk.
In many quarters this is perceived as violating an implicit contract between the storage array and the client of the storage array. And in many ways, this is at the heart of the Real vs Better Than Real FiberChannel debate.
The core question is: does this matter?
So let me, right off the bat, say: of course it does. If an application goes to the trouble to assume that two disk blocks are sequentially located on disk, then it’s assuming a certain kind of performance when it requests the data. If the storage array turns that sequential operation into several random operations, then the application performance will degrade. This is just a fact of life.
Case closed, then. WAFL sucks.
Not so fast there storage architect.
Let me first observe, that the traditional storage arrays are able to transform random write operations into sequential operations through the use of clever caching and architectures that rely on global shared memory.
In a similar vein, it is possible to do the same thing for read operations, you can apply clever algorithms to make what is a fragmented on disk layout, appear to be sequential by being clever.
Making this concrete
Suppose we have an application that writes out a file in the following order:
Where the data A was at offset 0, and B was sequentially located after A in terms of offset.
The application then has the expectation that if it operates on the file by sequentially accessing A, B and C, that it will get sequential performance.
On the other hand the application has the expectation that if it operations on the file by accessing A, D, F, that the performance will be that of random disk IO.
Last week I showed this picture that showed how WAFL did write operations:
Now let me show a different picture that shows how disk blocks could get laid out by WAFL.
The light green blocks represent free blocks. The blocks marked P are parity blocks. The blocks that are not green and have the letters A, B, C, D, E, F, G etc represent allocated blocks. A single allocated file on this set of disks is represented by a sequence of blocks share the same color. In this example we have four files.
In this example, a sequential read operation would start at a specific row, and then read the blocks in a column starting from the row. A sequential read for example would be K1, L1 and M1.
So now if we consider an application that wants to read sequentially, you can imagine how this turns into a sequence of random read operations.
For example suppose the applications wants to read blocks A, B, C and D of the blue file and then wants to read E, F and G. A simplistic implementation would do one random read to read blocks A and B, another random read for block C, and a third random read for D. A clever implementation however, would recognize that A, B, C and D although not sequentially located on disk, can be read sequentially if you willing to skip over uninteresting disk blocks. A clever implementation would also perform read-ahead to read blocks E, F and G, such that the request for the next set of blocks came from memory rather than from disk, minimizing the total IO to disk.
To be fair, constructing such algorithms that work well in general is a hard problem. As I said earlier, the relatively simple problem of a single processor scheduler that has been studied for almost 40 years still produces new insights, this is a relatively new problem and we’re still learning.
But here’s where things work in our favor. All of these algorithms benefit from increasing compute power because you can spend more time thinking, and increasing memory so you look further into the future. Both of those trends have worked in our favor over the last 15 years.
CPU performance has increased so much faster than disk drive performance, that we are able to perform very sophisticated computations before we do any disk operations. Memory capacity has increased dramatically as well allowing us to store more things in memory before we need to actually perform an operation.
But it’s more than just hardware. It’s also having a brilliant engineering team that has been focused on solving this problem for almost fifteen years. And no amount of hardware can replicate that experience.
But don’t trust me trust benchmarks that have validated this approach. Alex McDonald in his blog shows how we were able to sustain our performance even if the face of daunting random IO.
But why does this matter, in general?
Fragmentation of the on-disk layout is not just a property of WAFL, it also is a property of things like deduplication, thin provisioning and Real Snapshots. And the reality is making those things perform well requires the same kind of algorithms and sophistication that you need to make Better Than Real FiberChannel work.
The good news is that we’ve been working on those problems for a long time, which is why our benchmarks always have used snapshots and enabled thin provisioning.
Oh that might be the reason our competitors don’t have those things enabled and recommended turning them off when you need good performance...

