Random Rocks And Benchmarks
“People understand contests. You take a bunch of kids throwing rocks at random and people look askance, but if you go and hold a rock-throwing contest -- people understand that.” (Don Murray)
And that, in a nutshell, is the origin of most competitive sports. What starts as "Hey, I can chuck this boulder through that window over there -- betcha you can't!" develops through time a set of rules, referees, measuring devices, a governing body (sometimes two or three) and the sport of rock-chucking may even get recognised by many, played by many and make it into the Olympics. Like spear throwing (the javelin) or blasting birds out of the sky with a shotgun (clay pigeon shooting), rock-chucking has matured into shot-putting.
Just like benchmarks. Originally used to demonstrate some feature of your system in all its one-trick glory (HP IOPS from cache is a classic example of how to do this), modern industry benchmarks make the attempt to allow some real-world comparison between competitors. The relationship to real life is debatable, but the rules by which they operate are carefully designed to inject some aspects of realism to comparative claims.
Benchmarks are intended to be a repeatable test of a set of skills. But, as with all competitive sports, sometimes there's the kid who starts his benchmarking career hanging round on street corners stoning passers-by for entertainment, or hoisting up the nearest large rock at hand and heaving it through the closest plate-glass window.
And the new kid is over at HP, using a non-benchmark as a benchmark, and generally lobbing rocks around in all directions. I'm not going to dissect the post in detail, because others have and will continue to do that. There's one little paragraph I want to focus on, because it demonstrates one of my pet peeves; benchmark intuition.
Now things were starting to make sense. We were seeing the same sort of decay curve as shown in the IOMeter results posted in Making Sense of WAFL - Part 4. Every time the test is run, the random component of the Jetstress database accesses fragment the LUN further and the throughput numbers get worse. An array like EMC CX or HP EVA wont undergo this sort of decay curve since these arrays do not have internal WAFL-fragmentation problems like the FAS does.
The non-benchmark is Microsoft's ESRP, and the tester's intuitive assumption is that WAFL fragments; hence the tester's intuitive assumption that this is the source of the diminishing throughput numbers.
Bzzzt. Big fail.
Let's allow the first intuition; let's allow, for the sake of this demonstration, that "WAFL fragments your data". Here's a simple example to demonstrate why his intuition is wrong on fragmentation being the source of the problem. Exchange 2007 generates small random IOs (and that's the JetStress that HP are using in their test). The table below has 5 columns to demonstrate why small random IO works just as well (or badly, depending on your take) on randomly laid out data as sequentially laid out data.
- Random Placement: I've place 100 blocks randomly. The numbers have been generated from www.random.org. Slot 1 is block 67, slot 2 is block 19 and so on.
- Random Requested Block: this is meant to simulate IO requests from Exchange; again, drawn from a different run from www.random.org.
- Matching Block (Random Placement): this is where the requested block actually lives, So asking for block 80 requires a visit to slot 22, and so on.
- Seek Distance (Random Placement): this is the effective seek distance between requested random blocks. After we visit slot 22 (for block 80), we need to visit slot 58 (for block 37), requiring a seek of 36 slots.
- Seek Distance (Sequential Placement): this is the effective seek distance between requested sequential blocks. After we visit slot 80 (for block 80), we need to visit slot 37 (for block 37), requiring a seek of 43 slots.
(I'm having difficulty uploading the spreadsheet to TypePad, so when I get it fixed you'll be able to "Click to download the whole spreadsheet". Not yet though.)
| Random Placement | Random Requested Block | Matching Block (Random Placement) | Seek Distance (Random Placement) | Seek Distance (Sequential Placement) |
| 67 | 80 | 22 | 0 | 0 |
| 19 | 37 | 58 | 36 | 43 |
| 75 | 18 | 61 | 3 | 19 |
| 23 | 26 | 53 | 8 | 8 |
| 85 | 57 | 63 | 10 | 31 |
| 59 | 100 | 14 | 49 | 43 |
| 14 | 59 | 6 | 8 | 41 |
| ... | ... | ... | ... | ... |
| SUM | 3269 | 3322 |
Hey, look at that! The sequentially laid out data takes more slot seeks than the randomly laid out data! Try it yourself, replace the 100 numbers in the first two columns with random numbers, and check the seek distance sum.
On average, they will be equal. In fact, if the requested blocks are random, it doesn't matter how the data is laid out. Intuition fail.
Here's the professional sport of benchmarking, which HP don't take part in (still being, as it were, at the rock-throwing stage);
- NetApp SPC Benchmarks (no HP here)
- SFS2008 NFS benchmarks (HP missing again)
And, of course, the official ESRP results (and, just as a reminder, these aren't benchmarks)
Having failed at shot-putting, perhaps HP might want to pick another sport for their talented testers. Like nude football.
[updated to correct some borked links and a typo].


