« February 2007 | Main | April 2007 »

March 2007

March 29, 2007

Tom Mendoza's "Three Step Management Algorithm" for Solving Any Problem

I became a manager in November of 1999. Before that I'd been a programmer, an architect, a visionary and an evangelist, but I'd never managed people. After that, I became VP of Engineering at NetApp with 250 people reporting to me. We hired another 500 engineers in the next couple years, so I had to learn fast.

There were always lots of problems to deal with, and I sometimes went to Tom Mendoza for advice, since he'd been managing for years. He always asked about the people working on the problem. Who was involved? What other things were they working on? What were they good at?

It was clear that Tom thought about things very differently than me. I would dig in on the problem itself. I'd learn about the details, explore the options, and worry about the right answer. Tom didn't focus on the problem; he focused on the people whose job was to solve the problem. After talking with Tom, I seldom understood the problem any better, but I had lots of ideas about how to move forward in solving it. Occasionally I was the right person to dig in and come up with a solution, but especially as the organization grew, it became obvious to me that Tom's approach was much more powerful and scalable.

Since I'm an engineer at heart, I reverse-engineered the process that Tom seemed to be using as he questioned me. I concluded that Tom applied a simple three-step algorithm to every problem:
  1. Who owns the problem?
  2. Do I trust them?
  3. How do I find an owner I trust?
If you can't find an owner, that may be the problem right there. Skip to step (3). Sometimes it's obvious what person or group owns the problem, and they just need to be reminded that it's theirs and that you are watching.

When you find the owner, the next step is to figure out whether you trust them. I don't mean trust in some abstract sense; I mean trust them to fix this particular problem. In the abstract, I trust Tom Mendoza completely. For a problem involving spreadsheets or programming languages, I don't trust him at all. Even if you trust someone's skills, they may be too busy to do more. Do you trust that they have the skills, the time and the passion to solve this problem?

If steps (1) and (2) fail, then you must find an owner you trust. Sometimes there's someone nearby who can take it on. Other times you may need to reassign someone or hire a new person. Sometimes the answer is "Do it yourself," but the larger the organization, the less often this will be true.

Tom's background is sales, and he knows absolutely nothing about programming, so imagine my surprise when I watched him apply his algorithm recursively. By following steps (1) and (2) he concluded that he needed to hire someone to own a particular problem. "Hire someone" became the new problem, and by following steps (1) and (2) he concluded that he wanted to use an external headhunter. "Find a headhunter" became the new problem, and—here we reach the root of the recursion—Tom identified the owner as the VP of Human Resources, who he trusted.

March 22, 2007

Power in the Data Center: To Put a Watt In, I Must Take a Watt Out

It’s interesting talking to customers about power in the data center, because they have such wildly varying perspectives. Some say that power doesn’t really matter at all, and they wonder what the big fuss is. Others say power is the single most critical issue in their data center, and they are surprised that anybody might not agree.

The folks who say power doesn’t matter either haven’t thought about it much, or else they tell me that they’ve done the math, and the cost of keeping their spindles spinning just isn’t that high compared to the cost of buying and managing the storage.

The ones who say power is critical typically can’t put any more power into their data center. In some cases they literally can’t get more power – the power company won’t sell them any more. In other cases, they’ve hit limits on the wiring or the cooling in their data center.

A financial customer in New York explained it best: “We’re at 100% of power capacity today. For every new watt I bring in, I’ve got to figure out how to take one out.” He was very interested in upgrading to new storage systems that consume fewer watts-per-terabyte.

He was also interested in VMware, since that often drives large power savings. (See this blog on how one customer used VMware to reduce power by 450 kW/month.) In most data centers, servers consume more power than storage, so most people start there, but consolidating storage is the obvious next step.

There are many ways that storage companies can help you reduce power, but – surprisingly – more efficient hardware is low on the list. We all take about the same power to keep a spindle spinning, because we all use pretty much the same disks, power supplies, processors and so on.

On the other hand, it takes roughly the same power to run a 144 gigabyte FC drive as a 750 gigabyte ATA drive, so using the largest drives possible is a great way to save. To use ATA drives for mission critical data, you’ll want a RAID that protects against double disk failures. Any feature that improves utilization will also reduce power. Use RAID instead of mirroring. Use thin provisioning. Use clones or snapshots instead of full copies. (For details, check this paper which has point-by-point recommendations for reducing storage power consumption. This paper describes what NetApp’s own IT team did to save power.)

To summarize, the biggest savings don’t come from hardware, but from software features that improve storage efficiency and storage utilization.

What I love about all of this is that self-interest actually drives customers to a greener data center. One of my frustrations with corporations is that economics often seem to trump “good citizenship”, so I love it when economics actually drive companies to do the right thing.

March 16, 2007

Analyst Day Vision Themes: "Application Integration" and "Smart Copies"

We had our annual analyst day in New York this week. That's when we bring in several hundred financial analysts and industry analysts and share our progress, vision and strategy. My focus was on our vision for the future—how we can direct innovation in a way that matters to our customers. I had two main themes.

The first theme was Application Integration. CIOs care much more about the applications that run their business than they do about their storage, so the best way to be relevant to the CIO is to provide the best possible data management environment for their apps. To put it another way: If the application is King, how can we make the King look good? You'll have a pretty good sense of what I talked about if you read blogs like Booth Duty at Oracle Open World: FlexClone is the Big Hit, Using Simple Pictures to Control Data Protection Policies, and Data Management and Automated Teller Machines.

The second theme was Smart Copies. This is a new layer of storage that customers create when they make a second copy of their data—usually as part of a disk-to-disk-to-tape backup scheme—and then use features like snapshots and cloning to get more business value from the second copy. Examples include long-term archives for compliance or clones to accelerate test and development for SAP and Oracle. Many of our customers are starting to create a "smart copy infrastructure" containing copies of almost everything in their primary storage.

Part of what makes our copies "smart" is that we make them so easy to create. For business continuance with mission critical data, create a synchronous copy that exactly replicates your primary storage. For less critical data, save money by putting the copy on inexpensive ATA drives, and by updating the copy at night when bandwidth is cheaper. Or update once an hour if you want. It's completely flexible.

We often brag that our unified architecture makes it easy for customers to choose between SAN, NAS and iSCSI, depending on what's best for the app, but I think it's equally important to offer a wide variety of data protection capabilities. With most storage arrays, the only option for replication is from one storage system to another that's just the same—like-to-like. With NetApp, it's easy to replicate from high-end SAN with 72 GB Fibre Channel drives to a much less expensive iSCSI system with 750 GB ATA drives. We even have tools (see here and here) to bring data from other vendor's primary storage into our smart copy infrastructure.

The other thing that makes our copies "smart" is features like snapshots and cloning. Backup or DR may be the reason you created the copy, but once you have the copy you can clone it to let more people access the data. A clone is a "virtual copy" that takes very little space, so it's fast and easy to create as many clones as you want. Or you can use snapshots to keep data for a long time. For compliance, make the snapshots tamperproof to ensure they can't be changed.

Clones are especially valuable for development and test in SAP and Oracle environments. Clones speed up test and dev in two ways. First, clones speed up the test cycle itself. Copying a multi-terabyte database is slow, but creating a new clone is instant. You can quickly create a clone, run a test, and check the result. If the test fails, fix the bug and try again. Second, you can afford to create lots of clones, since they don't take any extra space until you write to them. Real copies of a big database are expensive, so people have to share. With clones, it's cheap to create a copy for everyone. People are faster and more efficient when they can work in parallel.

In a way, you could say that application integration is all about making the first copy of data better (primary storage), and smart copies are all about making the additional copies better (secondary storage). When you put the two together, you get a very powerful model of data management.

March 08, 2007

Admire and Respect Great Benchmark Results, But Also Be Careful

I'm proud of our new midrange systems, the FAS3040 and FAS3070. Both have benchmark results that blow away the competition. (For detailed results, see this press release on the 3040 and this one on the 3070.)

From this position of strength, I believe it is an excellent time to acknowledge the downsides of benchmarks. Good benchmark results are valuable. High numbers indicate strong hardware and carefully tuned software. Increases within a single architecture (like the 3020 to the 3040) usually indicate real improvement. Still, real-world results can be different from what benchmarks predict, so customers must evaluate performance in other ways as well.

Here's an example. Years ago, NetApp and Sun did a performance bake-off at a large software development company, using their actual application. The results were fascinating. The SPECsfs benchmark result for Sun was ten times faster, but for this customer's workload, NetApp was four times faster. The benchmark was wrong by a factor of 40.

Sun sent in a team of Sales Engineers, and after a week of tuning they doubled the performance—still half as fast as NetApp. Then Sun called in the "big guns". One of their key NFS developers came in, and after another week of tuning, he matched NetApp's performance. The numbers matched, but it was a win for NetApp because we delivered the result on day-one with no tuning. The customer appreciated Sun's effort, but said, "Realistically speaking, they aren't going to send those guys out every time I install a new system, so I won't see that performance in my data center."

How could a benchmark be so wrong? SPECsfs is actually quite good, but there are two main reasons that benchmarks differ from the real world:
  1. Benchmark configurations don't always match your configuration.
  2. Benchmark workloads don't always match your application workload.
In this case, both were true. Sun had benchmarked an absolutely enormous config, which isn't what the customer got. And the customer's workload was very different from what SPECsfs measures.

Typically after any vendor announces good benchmark results, you'll see a series of he-said-she-said arguments about exactly these issues. Examples from the FAS3070 launch are here and here.

NetApp mitigates the first issue by benchmarking "realistic" configurations. We benchmark commonly-purchased hardware with normal features enabled, like Snapshots, RAID-DP, and FlexVols. Even though we test configs that many customers buy, it's not necessarily the config you will buy, so your mileage will still vary.

The fact that benchmark workloads don't match real-life workloads is harder to fix. One approach is to demand a vendor bake-off in your own environment, but few customers have the resources to simulate their full production workload. Alternately, you can press the vendor for case studies or references from customers like you, running the same application at roughly the same scale.

It's important to follow vendor best practices for your app. We've got folks in our lab who know how to configure EMC systems to run really slow, and they have folks in their lab who know the same for NetApp. Pay no attention! Focus on results from configurations the vendor recommends. (Do check that the recommended config has the features you plan to use. Many features can hurt performance.)

Benchmarks are valuable, despite some flaws, but you must read between the lines to understand the true message. Are commonly used features enabled? Is data protection turned on? Are LUNs created in unusual ways? One trick I've seen is to create LUNs that span many disks, using just a small sliver of each one, with no RAID protection enabled. Nobody would ever configure a real-world system that way. In other words, poke at how the benchmark config differs from what you plan to buy.

In conclusion, having established my dispassionate honesty by taking the high road and acknowledging that benchmarks aren't perfect, let me summarize by saying: The FAS3040 and the FAS3070 really scream. Check them out!

March 02, 2007

Using Simple Pictures to Control Data Protection Policies

In Data Management and Automated Teller Machines, I described a vision of data management. The gist was that application administrators ought to be able to provision and manage data themselves, without bothering a storage admin, just as I can get cash from an ATM myself, without waiting for a bank teller.

ATMs are only safe because banks have policies that detect problems and determine how much cash I can withdraw at a given point in time. Likewise, our ATM vision of data management requires tools to let storage admins easily define data management policies.

Our new Protection Manager focuses on policies for data protection. A policy is a rule that describes how to protect the data. The idea is to let storage admins reflect the corporate rules, guidelines or SLAs (service level agreements) independent of specific NetApp technology. A policy can say "make copies every week and keep them for at least a year" or "retain undeletable copies for seven years." Our automation engine evaluates which technologies are available (has the customer licensed SnapVault? SnapMirror? SnapLock?) and connects the plumbing in a way that satisfies the policy's goals. Over time, the engine monitors whether the data conforms to the policy's goals. The key point is that you can tell the Protection Manager your goals and let it figure out the details.

Protection Manager lets you define policies in a graphical, intuitive way. A simple picture represents the policy. An icon on the left side represents the primary storage, and one or more icons on the right represent copies of the data. Arrows between primary and copy show the type of copy. Click the diagram to edit how and when the transfers should happen. Should a mirror update once an hour, or just at midnight? Is the backup window open all day, or only at night? How many primary copies should be retained and how many backup copies? The tool isn't just about backups and snapshots. Our plan is to also support the undeletable and unalterable copies required to comply with government regulations.

After you have defined exactly how the policy works, you can give it a name. Maybe "Gold" means an offsite mirrored copy updated throughout the day plus a year's worth of backup copies, "Bronze" means one backup a day at midnight kept for just one week, and "SEC-17A" means unalterable and undeletable copies kept for 7 years.

You can apply a policy to a single volume or LUN, but you can also apply them to a user-defined group called a dataset. If you have a large number of LUNs that all support the same application, you can group them together in a dataset and apply the policy to the dataset as a whole.

The idea is that instead of worrying about hundreds or thousands of mirroring relationships for hundreds or thousands of LUNS and volumes, you can define a handful of policies, group your data into a much smaller number of datasets, each of which gets the appropriate policy. Another benefit is that defining standard policies makes it easier to deliver storage broadly as a service within a company. Formalized policies lay the foundation for execution, predictability.

We don't yet allow application admins to set protection policies on their own, but that is the next step. Our plan is to add these features to our own application integration tools, like SnapManager for Oracle, but we understand that not everyone uses those tools, so we are also offering APIs so that we can incorporate these capabilities into frameworks like Oracle Fusion, Microsoft .Net, or SAP NetWeaver.

We haven't yet achieved the full vision—to be honest not even close—but I think we are ahead of most vendors. Others have talked about this kind of model for data management, but we have a big advantage because we have a unified architecture that spans our whole product line: primary to secondary, high-end to low-end, and SAN to NAS to iSCSI. Our storage management team can focus on cool new features instead of on how to make incompatible architectures—like DMX, Clariion, Centera and Celerra—look more or less the same.

Recent Posts



Subscribe to Dave's Blog

RSS 2.0
Atom
© NetApp, Inc.  |  "Safe Harbor" Statement