« June 2007 | Main | August 2007 »

July 2007

July 27, 2007

Green Data Centers and Solar Villages (Change is Most Likely When Heart and Wallet Align)

The Solar Electric Light Fund (SELF) provides solar lights to poor villages in developing countries. The trouble with solar is that it doesn’t work at night. (D’oh!). Off-grid solar is impractical for most first world houses because it takes expensive batteries to run 100-watt light bulbs and big TVs. When the competition is a dim and smoky kerosene lamp, small/cheap batteries work just fine. The payback is surprisingly fast; villagers already pay $5-10 a month for kerosene. The unexpected result is that solar power today is economically feasible for poor rural villages, but not for first world homes. Just as some developing countries have gone straight to cell phones, skipping landlines, rural villages may skip the power-grid and go straight to solar.

I must be a capitalist at heart: I love that people often want to do the right thing, but I believe that large-scale change is much more likely when supported by good economics. SELF’s approach is so powerful because using solar instead of oil feels like the right thing, but they have improved the odds of success by focusing where there is a positive return on investment (ROI). Once they show the way, their approach should become a virtuous circle, spreading rapidly without more charity. SELF gets the ball rolling, gets a local industry going, and then moves on to the next country.

I believe that a similar dynamic will drive power savings in corporate data centers. In theory, corporations may want to do the right thing by running green data centers, but it’ll be the economic benefits that drive large-scale change. This has been such a hot topic lately that the EPA – under direction from congress – is about to release a report on data center energy efficiency. In drafting the report, the EPA was interested in hearing what NetApp did to save power in our Sunnyvale data center.

Last year we did a major project to improve data center power efficiency. We increased storage capacity and performance, while achieving these results:

  • 80% reduction in power (329kW to 69kW)
  • 80% reduction in rack space (25 racks to 5.5 racks)
  • 60% improvement in storage utilization (under 40% to about 60%)
  • $1 million direct savings from reduced energy cost and PG&E rebates
  • $1.5 million additional savings expected over 18 months

How? The short answer is that we upgraded to newer more efficient hardware, and we used advanced features in Data ONTAP 7G to improve storage utilization. (For more details, see this case study, this report, and this blog.)

This was an easy project for us to justify, because it had sales and PR benefits. We were showing our customers how NetApp equipment, properly deployed, can save power. But never mind the sales benefits, the savings alone justify the project. I haven’t even mentioned savings from not having to expand our data center. We were approaching full capacity, but now we’ve got space/power/cooling to spare.                                                                  

One of my frustrations with capitalism is that – on average – corporations seem much less interested in doing what’s right than individuals. (Perhaps spreadsheets and PowerPoint presentations somehow inhibit moral behavior. Topic for another blog.) But in this case, I’m confident that the right thing will happen anyway, because the economic benefits are so strong. When projects are green in the wallet sense, as well as the environmental sense, they are much more likely to get funded.

 

July 20, 2007

Lies, Damned Lies, and Benchmark Results (The Ferrari versus The School Bus)

If Mark Twain were alive today, he might revise his famous quote:

There are three kinds of lies: lies, damned lies, and benchmarks.

It’s not that benchmarks are inherently bad, any more than statistics – the subject of Twain’s original quote – are inherently bad, but for both benchmarks and statistics, you need to understand the details pretty well to discern their message. For benchmarks, understanding the interaction between latency and throughput is particularly important.

Good latency (or response time) is like a Ferrari. Never mind how many people it holds, you sure do get to your destination in a hurry.

Good throughput is like a bus. Never mind how fast it goes, you sure can take a lot of kids.

The combination of good latency and good throughput is like a jumbo jet. You don’t always have to choose between speed and capacity.

When reading benchmarks, people often jump to the “big number”, the maximum operations per second. If you care about speed, this is a big mistake! It’s like choosing a racecar based on how many kids it can hold.

I really like how the SPECsfs benchmark reports data. It runs a series of tests at increasing load levels, and measures the response time at each level. Instead of comparing the maximum ops, I focus on how many ops a system can perform at less than 1 millisecond response time. (Ten years ago, I used a 10 ms cutoff. Today that’s too slow to be useful.)

Here are a couple of typical SPECsfs results pages: one for the NetApp FAS3070A and another for the EMC Celerra NS80G. If you simply look at the maximum ops, you get one story. The EMC maxes out at 86,372, and the NetApp at 85,615 – less than 1% difference. But if you look at the peak response time, NetApp (at 2.9 ms) is almost twice as fast as EMC (5.3 ms). That’s a misleading comparison, though, because the EMC really spikes up in the last point, but it’s not too bad up till then. As I said above, I think the best metric is to compare the 1 ms cutoff: 34,277 for EMC and 42,659 for NetApp. For “fast ops”, NetApp has a 25% advantage.

Depending how you measure, the two systems go from a tie, to EMC almost twice as slow, to NetApp doing 25% more ops. Benchmarks can make your head spin.

The difference is even more extreme if you compare against the BlueArc Titan 2200. The Titan maxes out at 98,131 ops, which is about 15% more ops than the FAS3070, but at the 1 ms cutoff, the FAS3070 does over twice as many ops. The Titan does more ops, but it does them slowly, school bus style. (If you want jumbo jet performance, check out the GX Cluster, which came in at over a million ops max, with over a third of a million at the 1 ms cutoff.)

The lesson is not that benchmarks are bad! The lesson is that to understand benchmarks, you need to understand what matters to you – matters for your particular environment. You can learn lots from a good benchmark, but you must dig deeper than the one big number. (See also this entry.)

[NOTE: SPECsfs doesn’t report the 1 millisecond cutoff directly. I calculate it based on a linear interpolation of the point just above and just below. For apples-to-apples comparisons, I used only results for NFSv3 over TCP.]

July 16, 2007

Extreme and Surprising Events Happen More Often Than You Think

I just read The Black Swan, by Nassim Taleb. To summarize the whole book in 10 words: Extreme and surprising events happen more often than you think.

Many psychology studies have shown that humans are inherently bad at dealing with improbable events. (See my recent blog on Shark Island.) Taleb talks about this, but to a mathematically minded person like me, his real point is much scarier. He argues that bell curves and standard deviations—tools that number people use to understand probability—often fail in the real world. With Gaussian bell curves, the probability of extreme events goes down exponentially as you get further from the average. But in many real-world situations, the probability goes down much slower for extreme events. The tail is fatter. If you trust standard statistics, you could end up in big trouble.

Taleb uses concrete examples to build intuition. Peoples' height is a normal bell curve, but wealth is not. Suppose you randomly select 10 people out of the entire world, and check their height. The average will be six feet, or whatever it is. Now take the world's tallest person and add him to the mix. The average only goes up three inches. Increase your random sample to a hundred, and throwing in the tallest person changes things by less than an inch.

Now try the same thing with wealth. Take ten random people worldwide, and their average income is $10,000 or whatever—remember I said worldwide. But add Bill Gates into the mix, and the average goes up many thousand-fold. Even if you increase the sample size to 1000, adding Bill makes the average several hundred times higher—from ten thousand dollars to millions.

What if height were distributed the way wealth is? Six feet might be the most common height, but there would be many 10 feet people wandering around, and even some hundred footers. Mathematically speaking, this is the difference between a normal bell curve and a power curve, but an example can show the difference clearly even if you don't care about the math. Consider a town with a million people, and compare how many tall ones there are with a bell curve versus a power curve. (Check the endnote for math-nerd details.)


Count of all people
People over 6'
People over 6'3"
People over 6'6"
People over 7'
People over 8'
People over 10'
People over 100'
Bell Curve
1,000,000
500,000
158,655
22,750
32
0
0
0
Power Curve
1,000,000
500,000
158,655
81,067
34,790
13,143
4,584
27

At first the two curves look similar: half the people are over six feet, 158,665 are over six-three—exactly the same. But at extreme heights, things are so different. With a bell curve, you will never see a ten-foot person. There is some theoretical probability that it might happen, but it's so rare that you can count on never seeing it in your life. With a power curve, there are not only 4,000 ten-footers, but 27 one-hundred-footers! For the planet as a whole, the power curve predicts several dozen people over ten thousand feet tall.

With power curves, extreme and surprising events happen more often than you think.

It is obviously very important when you are managing probability (or risk), to understand which curve applies. How would you design hospital beds, for your town of a million, if you knew that hundreds of citizens were over thirty feet tall? What if you thought height was a bell curve, and built eight-foot beds, but it turned out later to be a power curve?

Power curves are very common where big guys can grow at the expense of little guys. Tall people can't take height from short people, but large companies (e.g. Microsoft) can take business from small ones. Popular websites (e.g. Google) take web-hits from small ones. As a result, power curves are very common in business statistics. Power curves can also result when there are many interactions between elements. I suspect that failure events in interconnected infrastructures, like the nation's electric grid or enterprise data centers, follow power curve rules.

Taleb mostly worries about the implications for investors, but I see lessons for companies as well. Don't trust your plans. You still have to make plans, but they will change more than you think. Don't trust statistics on small or medium sized samples unless you know it's a bell curve. If you suspect a sample might be a power curve, don't trust bell curve statistics at all. Expect surprises.

[Math-nerd note: In my example, the average for the bell curve is 6 feet, and the standard deviation is 3 inches. For the power curve, I set 6 feet as the starting point in a half million person population, made 6'3" the first doubling point, and selected the exponent to make the probability match the bell curve at 6'3" for easy comparison. The exponent was 1.656. I ignored the population below 6 feet because power curves don't handle the left side of a peaked distribution well. You obviously don't have bazillions of people six inches tall, or people negative a thousand feet tall, so you have to cut off the left side somehow. I'm only half-good with math, so I probably made some mistakes, but I think it's accurate enough to make the point.]



Subscribe to Dave's Blog

RSS 2.0
Atom
© NetApp, Inc.  |  "Safe Harbor" Statement