« Using Simple Pictures to Control Data Protection Policies | Main | Analyst Day Vision Themes: "Application Integration" and "Smart Copies" »

March 08, 2007

Admire and Respect Great Benchmark Results, But Also Be Careful

I'm proud of our new midrange systems, the FAS3040 and FAS3070. Both have benchmark results that blow away the competition. (For detailed results, see this press release on the 3040 and this one on the 3070.)

From this position of strength, I believe it is an excellent time to acknowledge the downsides of benchmarks. Good benchmark results are valuable. High numbers indicate strong hardware and carefully tuned software. Increases within a single architecture (like the 3020 to the 3040) usually indicate real improvement. Still, real-world results can be different from what benchmarks predict, so customers must evaluate performance in other ways as well.

Here's an example. Years ago, NetApp and Sun did a performance bake-off at a large software development company, using their actual application. The results were fascinating. The SPECsfs benchmark result for Sun was ten times faster, but for this customer's workload, NetApp was four times faster. The benchmark was wrong by a factor of 40.

Sun sent in a team of Sales Engineers, and after a week of tuning they doubled the performance—still half as fast as NetApp. Then Sun called in the "big guns". One of their key NFS developers came in, and after another week of tuning, he matched NetApp's performance. The numbers matched, but it was a win for NetApp because we delivered the result on day-one with no tuning. The customer appreciated Sun's effort, but said, "Realistically speaking, they aren't going to send those guys out every time I install a new system, so I won't see that performance in my data center."

How could a benchmark be so wrong? SPECsfs is actually quite good, but there are two main reasons that benchmarks differ from the real world:
  1. Benchmark configurations don't always match your configuration.
  2. Benchmark workloads don't always match your application workload.
In this case, both were true. Sun had benchmarked an absolutely enormous config, which isn't what the customer got. And the customer's workload was very different from what SPECsfs measures.

Typically after any vendor announces good benchmark results, you'll see a series of he-said-she-said arguments about exactly these issues. Examples from the FAS3070 launch are here and here.

NetApp mitigates the first issue by benchmarking "realistic" configurations. We benchmark commonly-purchased hardware with normal features enabled, like Snapshots, RAID-DP, and FlexVols. Even though we test configs that many customers buy, it's not necessarily the config you will buy, so your mileage will still vary.

The fact that benchmark workloads don't match real-life workloads is harder to fix. One approach is to demand a vendor bake-off in your own environment, but few customers have the resources to simulate their full production workload. Alternately, you can press the vendor for case studies or references from customers like you, running the same application at roughly the same scale.

It's important to follow vendor best practices for your app. We've got folks in our lab who know how to configure EMC systems to run really slow, and they have folks in their lab who know the same for NetApp. Pay no attention! Focus on results from configurations the vendor recommends. (Do check that the recommended config has the features you plan to use. Many features can hurt performance.)

Benchmarks are valuable, despite some flaws, but you must read between the lines to understand the true message. Are commonly used features enabled? Is data protection turned on? Are LUNs created in unusual ways? One trick I've seen is to create LUNs that span many disks, using just a small sliver of each one, with no RAID protection enabled. Nobody would ever configure a real-world system that way. In other words, poke at how the benchmark config differs from what you plan to buy.

In conclusion, having established my dispassionate honesty by taking the high road and acknowledging that benchmarks aren't perfect, let me summarize by saying: The FAS3040 and the FAS3070 really scream. Check them out!

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2345678/17887272

Listed below are links to weblogs that reference Admire and Respect Great Benchmark Results, But Also Be Careful:

Comments

I would give away much of the performance for a more reliable software. When I see "Panic Message: Protection Fault accessing ..." I do not really care about performance anymore, do you?

Mr. Lightman,

"One trick I've seen is to create LUNs that span many disks, using just a small sliver of each one... "

It is fairly common practice in enterprises to spread IOPS out over multiple spindles with any vendor's array. What isn't common in these environments is to engage in the practice

"...with no RAID protection enabled."

I think that was the key point. =)

When discussing arrays, there are two types of "capacity" to keep in mind, IOPS and GBs (or TBs and PBs).

Benchmarking environments typically deal with IOPs. Vendors are trying to show you how cool their controllers are, and with modern arrays, it takes a lot of spindles to stress any controller in terms of raw IOPs.


Regards,
Max

Nice post, Dave, balanced and unbiased, at least as much as you can expect from one of the founders of NetApp :>)

Regards

Mario Apicella
http://weblog.infoworld.com/thestoragenetwork/

Hey Dave,

I realize you probably don't want to get drawn into EMC's he-said-she-said game (and props for posting the defamatory links), but one of their points seems to ring true. Looking at the SPEC SFS submission, it says NetApp used 224 x 72G disks (16T of delivered capacity), but the result of 60k IOPS indicates that you're only actually using 600G for the test (~4%). In your post you say this:

One trick I've seen is to create LUNs that span many disks, using just a small sliver of each one...

Isn't that exactly the case for your SPEC SFS submission? Granted you're not explicitly slicing up the disks (apparently), but given each disk's utilization the benchmarked configuration seems to exact the equivalent result.

- dlight

The comments to this entry are closed.



Subscribe to Dave's Blog

RSS 2.0
Atom
© NetApp, Inc.  |  "Safe Harbor" Statement