Lies, Damned Lies, and Benchmark Results (The Ferrari versus The School Bus)
If Mark Twain were alive today, he might revise his famous quote:
There are three kinds of lies: lies, damned lies, and benchmarks.
It’s not that benchmarks are inherently bad, any more than statistics – the subject of Twain’s original quote – are inherently bad, but for both benchmarks and statistics, you need to understand the details pretty well to discern their message. For benchmarks, understanding the interaction between latency and throughput is particularly important.
Good latency (or response time) is like a Ferrari. Never mind how many people it holds, you sure do get to your destination in a hurry.
Good throughput is like a bus. Never mind how fast it goes, you sure can take a lot of kids.
The combination of good latency and good throughput is like a jumbo jet. You don’t always have to choose between speed and capacity.
When reading benchmarks, people often jump to the “big number”, the maximum operations per second. If you care about speed, this is a big mistake! It’s like choosing a racecar based on how many kids it can hold.
I really like how the SPECsfs benchmark reports data. It runs a series of tests at increasing load levels, and measures the response time at each level. Instead of comparing the maximum ops, I focus on how many ops a system can perform at less than 1 millisecond response time. (Ten years ago, I used a 10 ms cutoff. Today that’s too slow to be useful.)
Here are a couple of typical SPECsfs results pages: one for the NetApp FAS3070A and another for the EMC Celerra NS80G. If you simply look at the maximum ops, you get one story. The EMC maxes out at 86,372, and the NetApp at 85,615 – less than 1% difference. But if you look at the peak response time, NetApp (at 2.9 ms) is almost twice as fast as EMC (5.3 ms). That’s a misleading comparison, though, because the EMC really spikes up in the last point, but it’s not too bad up till then. As I said above, I think the best metric is to compare the 1 ms cutoff: 34,277 for EMC and 42,659 for NetApp. For “fast ops”, NetApp has a 25% advantage.
Depending how you measure, the two systems go from a tie, to EMC almost twice as slow, to NetApp doing 25% more ops. Benchmarks can make your head spin.
The difference is even more extreme if you compare against the BlueArc Titan 2200. The Titan maxes out at 98,131 ops, which is about 15% more ops than the FAS3070, but at the 1 ms cutoff, the FAS3070 does over twice as many ops. The Titan does more ops, but it does them slowly, school bus style. (If you want jumbo jet performance, check out the GX Cluster, which came in at over a million ops max, with over a third of a million at the 1 ms cutoff.)
The lesson is not that benchmarks are bad! The lesson is that to understand benchmarks, you need to understand what matters to you – matters for your particular environment. You can learn lots from a good benchmark, but you must dig deeper than the one big number. (See also this entry.)
[NOTE: SPECsfs doesn’t report the 1 millisecond cutoff directly. I calculate it based on a linear interpolation of the point just above and just below. For apples-to-apples comparisons, I used only results for NFSv3 over TCP.]




Your comments about the storage industry are very refreshing. But please have a look at the source code of this site: http://now.netapp.com, you will notice a href to http://now-devel.netapp.com/images/trans_spacer.gif.
I truly hope that the WAFL code is better!
-- Martin Mueller
--------------------------------------------------------------
Thanks for the bug report Martin! I sent it to the NOW web team and they tell me it's fixed now. They are looking into what went wrong, and how to stop it from happening again.
And yes, I hope the WAFL code is better too. As you might imagine, our test process for ONTAP releases is somewhat different than for web pages.
-- Dave Hitz
Posted by: Martin Mueller | July 20, 2007 at 05:50 PM
Wow, you are comparing a mistake in HTML of a web page with a rock solid file system with years of real-world usage, and powering HUGE stores of data around the world... I truly hope you were joking.
Posted by: Steven James | July 21, 2007 at 06:56 PM
I agree. I seriously doubt that a company the size of NetApp has their WAFL software engineers doing double duty as HTML developers. But speaking of misleading benchmarks, I ran www.netapp.com through the W3C web validator http://validator.w3.org/ and found that it failed with 78 errors. I also ran www.emc.com through the same validator http://validator.w3.org/ and it too failed with 357 errors.
Following Martin's logic NetApp's WAFL engineers are much better than EMC's.
Too funny!!!
Posted by: Mike Jones | July 21, 2007 at 11:29 PM
Spec benchmarks are very generic tests of one pre-defined sets of performance heatruns.
They do not reflect every customer environment and it also does not reflect how systems behave with millions of files where most systems fail.
Comparing the 1ms cutoff doesn't buy any customer anything in your EMC example.
In addition comparing a 3070 cluster with a single Titan head is also inappropriate.
-- Benchmark reader
-----------------------------------------------------------
I choose the single Titan head because I wanted to compare systems with roughly similar maximum ops. Comparing the 3070 against the 2-node Titan is also interesting. In that case, what you see is that the 3070 has 30% more ops at the 1-ms cutoff, even though the 2-node Titan has over DOUBLE the maximum ops.
I disagree that the 1ms cutoff doesn't buy any customer anything. For I/O bound applications, storage latency is the key limiter to performance. If the storage can respond in 1ms, then you get 1000 responses per second to a single thread. If the storage responds in 2ms, then you get 500. The difference between 1ms and 2ms seems small, but it cuts the performance of your application in half. Even with a 2ms cutoff, less than half of the Titan's ops are usable. For the 3070, 86% of the ops come in below the 2ms cutoff.
Of course, if you have a bazillion threads, each doing only occasional I/O, then the difference between 1ms and 2ms probably doesn't matter. I won't argue that a server that can do lots of operations slowly is never useful, just that it often isn't.
By the way, I completely agree with your point that SPECsfs doesn't reflect every customer environment, or systems with millions of benchmarks. I like SPECsfs, but it certainly isn't perfect.
-- Dave Hitz
Posted by: Benchmark reader | July 23, 2007 at 11:21 AM
@mike jones:
Just how many WAFL engineers does EMC have? :)
Posted by: TimC | July 23, 2007 at 06:12 PM
As an aside, shouldn't the specsfs numbers take into account the price of the system too, the way spec benchmark for tpc does ? One can design an expensive system (say by doing a bunch of stuff in hardware) and report huge numbers, but I would think one would want to look at price/performance also when comparing systems.
-- Aalop Shah
--------------------------------------------------------------
Aalop,
I like the idea of including system costs. For most customers, price/performance is at least as important as overall performance. For some reason, SPEC has never managed to overcome vendor objections to including price data. I don't know the details.
Even if you got the list price, you still wouldn't have a full apples-to-apples comparison, because different companies have such different discounting policies. But I have to agree that it'd be better than nothing.
-- Dave Hitz
Posted by: Aalop Shah | July 24, 2007 at 07:14 AM
A least EMC stopped using Raid 0 for these kind of benchmarks.
Still two CX3-80 as backend is not a real world "customer purchasable" configuration, specially compared to a FAS3070C (4 storage processors VS 2 storage processors).
SPEC benchmarks should have also a price/iops indication of the proposed configuration.
Posted by: Stefano Pirovano | July 25, 2007 at 09:23 AM
I find it interesting that you push people to look at the SPECsfs testing. Those results are not apples to apples comparisons to say the least.
Lets start out by looking at the number of spindles behind these tests, since in the end spindle speed is really the bottle neck factor in most tests. BlueArc was using 200 disks, NetApp was using 224, and EMC was using 300. So based on that alone I say kudos to NetApp, on the surface you are not only quicker in terms of latency, but you are using less disks then EMC.
However that is just one layer below the surface of this benchmark.
The next layer to look at is load generators. It took NetApp alomst 2x as many load machines to come up with those numbers.
The last layer I look at is the most telling of how to beat a benchmark. I am calling you out on it. NetApp was the ONLY vendor to run against 2 separate file systems. Both EMC and BlueArc ran a single file system and single name space.
I look foward to your response.
Posted by: Steven Schwartz | September 03, 2007 at 12:09 PM
Benchmarking can be very misleading and you stated this in your opening:
There are three kinds of lies: lies, damned lies, and benchmarks.
You can never do an accurate benchmark tests no matter if you have the same spec for spec storage array and infrastructure environment mainly because you can not apply this in the real world. In the real world not every customer is the same so benchmarking is only used as a guide.
Storage vendors no matter who, will only use benchmark results if the results are positive to their product/technology. I'm sure EMC or Titan can release benchmark results that shows their products to be more superior to that of NetApp.
There's only one way to run a proper benchmark, run the same spec array from NetApp and EMC for example over 100 (the more the better) different customer environments and use the statistic results as a real-world guide. Most importantly, such tests must be run independently and with no influences from any storage vendors at all.
So thank you very much for your candid report. It's very informative, but as you stated:
There are three kinds of lies: lies, damned lies, and benchmarks.
I will not use the results you provided when designing and architecting storage and infrastructure solutions to my customers.
Posted by: Silver D. | September 04, 2007 at 07:20 AM