September 27, 2009

Random Rocks And Benchmarks

“People understand contests. You take a bunch of kids throwing rocks at random and people look askance, but if you go and hold a rock-throwing contest -- people understand that.” (Don Murray)

image And that, in a nutshell, is the origin of most competitive sports. What starts as "Hey, I can chuck this boulder through that window over there -- betcha you can't!" develops through time a set of rules, referees, measuring devices, a governing body (sometimes two or three) and the sport of rock-chucking may even get recognised by many, played by many and make it into the Olympics. Like spear throwing (the javelin) or blasting birds out of the sky with a shotgun (clay pigeon shooting), rock-chucking has matured into shot-putting.

Just like benchmarks. Originally used to demonstrate some feature of your system in all its one-trick glory (HP IOPS from cache is a classic example of how to do this), modern industry benchmarks make the attempt to allow some real-world comparison between competitors. The relationship to real life is debatable, but the rules by which they operate are carefully designed to inject some aspects of realism to comparative claims.

Benchmarks are intended to be a repeatable test of a set of skills. But, as with all competitive sports, sometimes there's the kid who starts his benchmarking career hanging round on street corners stoning passers-by for entertainment, or hoisting up the nearest large rock at hand and heaving it through the closest plate-glass window.

And the new kid is over at HP, using a non-benchmark as a benchmark, and generally lobbing rocks around in all directions. I'm not going to dissect the post in detail, because others have and will continue to do that. There's one little paragraph I want to focus on, because it demonstrates one of my pet peeves; benchmark intuition.

Now things were starting to make sense.   We were seeing the same sort of decay curve as shown in the IOMeter results posted in Making Sense of WAFL - Part 4.    Every time the test is run, the random component of the Jetstress database accesses fragment the LUN further and the throughput numbers get worse.  An array like EMC CX or HP EVA wont undergo this sort of decay curve since these arrays do not have internal WAFL-fragmentation problems like the FAS does.

The non-benchmark is Microsoft's ESRP, and the tester's intuitive assumption is that WAFL fragments; hence the tester's intuitive assumption that this is the source of the diminishing throughput numbers.

Bzzzt. Big fail.

Let's allow the first intuition; let's allow, for the sake of this demonstration, that "WAFL fragments your data". Here's a simple example to demonstrate why his intuition is wrong on fragmentation being the source of the problem. Exchange 2007 generates small random IOs (and that's the JetStress that HP are using in their test). The table below has 5 columns to demonstrate why small random IO works just as well (or badly, depending on your take) on randomly laid out data as sequentially laid out data.

  • Random Placement: I've place 100 blocks randomly. The numbers have been generated from www.random.org. Slot 1 is block 67, slot 2 is block 19 and so on.
  • Random Requested Block: this is meant to simulate IO requests from Exchange; again, drawn from a different run from www.random.org.
  • Matching Block (Random Placement): this is where the requested block actually lives, So asking for block 80 requires a visit to slot 22, and so on.
  • Seek Distance (Random Placement): this is the effective seek distance between requested random blocks. After we visit slot 22 (for block 80), we need to visit slot 58 (for block 37), requiring a seek of 36 slots.
  • Seek Distance (Sequential Placement): this is the effective seek distance between requested sequential blocks. After we visit slot 80 (for block 80), we need to visit slot 37 (for block 37), requiring a seek of 43 slots.

(I'm having difficulty uploading the spreadsheet to TypePad, so when I get it fixed you'll be able to "Click to download the whole spreadsheet". Not yet though.)

Random Placement

Random Requested Block

Matching Block (Random Placement)

Seek Distance (Random Placement)

Seek Distance (Sequential Placement)

67 80 22 0 0
19 37 58 36 43
75 18 61 3 19
23 26 53 8 8
85 57 63 10 31
59 100 14 49 43
14 59 6 8 41
... ... ... ... ...

SUM

   

3269

3322

Hey, look at that! The sequentially laid out data takes more slot seeks than the randomly laid out data! Try it yourself, replace the 100 numbers in the first two columns with random numbers, and check the seek distance sum.

On average, they will be equal. In fact, if the requested blocks are random, it doesn't matter how the data is laid out. Intuition fail.

Here's the professional sport of benchmarking, which HP don't take part in (still being, as it were, at the rock-throwing stage);

And, of course, the official ESRP results (and, just as a reminder, these aren't benchmarks)

Having failed at shot-putting, perhaps HP might want to pick another sport for their talented testers. Like nude football.

[updated to correct some borked links and a typo].

September 02, 2009

NetApp's $1Million Essay

image I don't get paid by the word for what I write, and you probably don't either. Here's a rewarding way to change all that; pick up your pen and in 500 words or so you could be getting the equivalent of $2000 a word.

Yes, it's back; NetApp's $1 million dollar virtualization challenge. (You might remember the original challenge, and what we saw as a need to call out the exceptional advantages of NetApp's guarantee.)

And we're not looking for a Shakespeare.

In no fewer than 500 words, describe your current storage infrastructure and how it supports your VMware environment. Please include as many details as possible to differentiate your submission, such as: number of servers and VMs deployed, type of current storage connected to your virtual servers, your desired business metrics, ROI and timeframe for a new storage deployment in your virtualized environment.

My colleague Vaughn Stewart has made a couple of observations on his latest blog about the program, which is based on our virtualization guarantee. Vaughn makes a point about RAID-10 vs RAID-DP in the guarantee;

As many of you may know, this program has never been without its share of criticism by other storage vendors. In my opinion, I believe this criticism has been fair as a component of the guarantee program compares RAID-10 and RAID-DP on the merits of storage efficiency. While both technologies are near equals in terms of performance and data protection it is unfair to compare the 50% utilization of RAID-10 to the 87.5% provided by RAID-DP

I beg to differ (and did so in my original blog). I believe it's valid to compare the two, because the alternative -- RAID-5 -- just isn't up to the job of protecting your data.

I said back then; so why not compare to RAID-6?  The truth is, we could and would, but we still don’t find our competition selling it. It’s there on the spec sheet as a solution from some vendors – not all have the capability -- but it’s rarely put forward as a viable solution.

And that's the extent of our disagreement. The rest of his blog is spot on; and with public disclosure of the results, I too think we'll see high rates of savings without including RAID-DP.

Get scribbling!

.

August 31, 2009

Du Kommer Att Tala Svenska På VMWorld!

The frequency with which letters are used in the English language has long been know by typographers, who, in the days of mechanical typesetting, kept more E's and Ts to hand than Qs and Ks. Hence ETAOIN SHRDLU; the commonest twelve letters in English ranked by use that covered 80% of the letters used. Letters like Q, Z and V appeared very infrequently indeed.

Until now. Welcome to the changing world of English, where a quick analysis of some text on www.vmworld2009.com suggests that our new lettery overlord is

Normally used 0.98% of the time, VMware has single handedly pushed V way up the scale; There are at least 4 Vs on every page. Some pages have a V in every sentence.

That's more Vs than are used in Swedish.

I'm in San Francisco for VMworld this week, and I hope I'll meet you there. Where, like you, I will try not to sound like I'm speaking Swedish.

Vi ses på VMworld!

.

August 24, 2009

I'm Sending Chuck a Bluey

e-bluey logoMy youngest daughter is going out with a professional soldier. That's tough for her; soldiers in the UK Armed Forces are often on active duty, and he's served several long tours of duty in Afghanistan already, something that she finds difficult to handle emotionally when he's way.

During his last tour, she wanted to send him a bluey. It's the Services equivalent of a telegram; you write your bluey online, it gets sent and printed locally, and most if not all blueys are delivered in 24 hours, even to servicemen right on the front line. It's good for morale, and the service is excellent.

Her first look at the website resulted in tears. When she looked at the information to get the letter to him, she discovered that all she knew was his name, his rank, and that he was in Afghanistan. No other information, and although she thought she knew his regiment, she wasn't sure.

Dad to the rescue. The boyfriend's name is unusual. Very unusual indeed, with a first name from Greek antiquity, and a surname that appears four times in the local telephone directory -- and they're all family members. This guy has a name you'd never forget; a name, like Roman Rock (not his real name, because he's still on active service). If there's another person in the UK with the same name, far less in the Army, I'll eat my shorts. So I calmed her down, and helped here address her bluey, to Roman Rock. That was all the information we could fill in.

It got to him in less than 12 hours. We didn't know where he was located, but the Army did, because he had a unique name.

Chuck Hollis had an interesting blog on filling out blueys. Well, sort of; on what he sees as the death of the filesystem. The topic struck me as interesting; The Future Doesn't Have A File SystemAs usual, but with good reason, I disagree.

Object-based information stores are different than filesystems in several important ways. 

First, you use a token or other uniform identifier to get your information.  File systems imply location, tokens don't -- no such thing as a broken link or a moved file system.  Not to mention, tokens can uniquely identify gazillions of information objects.

Disagree; filenames don't imply location. They're just names; metadata that allows you to uniquely identify the data you wish to access. This is a problem that the internet created, and that the internet solved; see later.  

Second, they have the ability to associate all sorts of metadata with the object itself.  As the information object goes, its metadata travels with it.  A very useful property indeed.

And that differs how? You can associate metadata with a filename -- if you can do it for an "object", you can do it with a file. (Even old-fashioned filesystems keep things like last access time, size, and the name itself is often used to give further, human readable clarification, such as .html or .doc; although it may lie and we may care not to use it for such, that doesn't change to point of associating metadata with the name-as-a-handle.)

Third, the ability to hang metadata off the object gives us the ability to create all sorts of useful policies and services around the information without having to put everything in some sort of database or repository.

Double eh? If this (it's a GUID, or globally unique identifier);

c2f41010-65b3-11d1-a29f-00aa00c14882

doesn't require a repository or a database for its metadata, I'll eat my shorts. Again.

The internet solved this problem a good while ago. See RFC1737, RFC2141 and RFC3986. The essence of the RFCs; part is a name (like Roman Rock) and part is a location (like Afghanistan). The part that is the name doesn't tell you where the file is located; but the part that describes the location can be completely absent.

An example is in order. To demonstrate how far things have changed since Chuck banged away with Unix pipes and vi in the1900s, this returns a file;

http://blogs.netapp.com/shadeofblue/2009/08/poetry-corner.html

Interestingly, the file doesn't exist until you ask for it, because it's dynamically generated from parts. There's no directory shadeofblue, 2009, or 08, and no file poetry-corner.html either. And (here's a clue how far this goes) it doesn't live on a server at netapp.com either. It's all name and no location, and it works across the entire internet, not just inside a single object store.

Filesystems aren't the problem here; it's an attempt to make Atmos relevant. I think I'll send a bluey to Chuck and let him know; The Future Doesn't Have an Atmos.

 


I'm off to VMworld next week in San Fransisico. It's my first time at a VMworld conference, and I'm not quite sure what to expect.  But I do hope to meet lots of interesting people I've never met before while I'm out there. If you recognize me (yes, I look like the photograph), please introduce yourself. And no, I don't bite!

August 04, 2009

Poetry Corner

On the subject of running the NetApp simulator (the "sim" referred to below) an esteemed colleague of mine (Richard Barlow) suggested this as a possible workaround to the issues a NetApp systems engineer was facing in choosing a system to run it on.

As a workaround you could run a free copy of VMware server in Hyper-V and boot the sim in that. It mIght be a bit slow but it would work. VMware server can be virtualized pretty easily.     

Wow, I thought, is he serious? So I penned a light hearted reply in the form of a poem, modeled after the mirror in the Brothers Grimm's fairy tale Snow White;

Mirimageror, mirror on the wall,
Is that VM here at all?
Or is the thing I think I see
Running XEN on Hyper-V?

Mirror, mirror on the wall
Will I get work done at all?
And are the cycles shown as free
Really there on my PC?

Now I want to add a SAN
But how to run it on a LAN
That may be there, or maybe not;
Mirror, mirror, just what is what?

To which the mirror replied;

O Virtual SAN, though fair ye be
A real SAN’s fairer far to see.

Think you're a budding William McGonagall? All poetry submissions on the subject of technology welcome.

.

July 17, 2009

IBRIX Falls to HP

Today, HP announced that it was going to buy IBRIX, a small NAS scale-out software technology. HP have partnered with them for some time, so the move was probably a natural one.

A quick analysis; here are my initial thoughts.

  • PolyServe (aka Enterprise File Services) is dead. Dead as the Norwegian Blue. It always looked a bit like it was nailed to its perch, but this, if any was needed, is confirmation that doing scale-out well is hard, and that HP picked a loser the first time round.
  • This lets HP sell more servers; IBRIX is software only, and what better than another acquisition that lets them close out IBM and Dell in favour of HP tin.
  • Cloud thinking must be in there somewhere. A scale-out NAS is always going to be preferable over a bunch of strung together aging EVAs.
  • What is confusing is the HP StorageWorks product range. It encompasses everything from DAS to NAS to SAN in multiple flavours; a veritable Tutti Frutti of storage. But then, it's all "designed" to sell servers from what I can see. Either it runs on an HP server, or it needs an HP server to front it off.
  • Lastly, and I think you'd have to agree on this one; it's an interesting time to be in storage.

Where and what (and who) next? Your guess on this one is as good as mine.

.

July 12, 2009

Free Cheese from EMC

image EMC's Barry Burke seems mighty pleased to announce that Virtual Provisioning (what everyone else calls thin provisioning) is now "free" (at least on the V-Max and DMX4 storage arrays but not on the CLARiiON).

I'm having a remarkably similar argument with HP folks on their assertion that the software on their LeftHand storage arrays is "free", even though there's a big difference in the price of similarly specified servers and the P4000 based on the same tin. The same with Sun; their "free" software on the the Sun 7210 array costs a lot. The system is based on a Sun X4540, which is much cheaper.

And Barry's challenge "We'll see if others follow suit" line is just so, well, yesterday's news. There are any number of storage vendors that have thin provisioning at no additional cost, including NetApp FAS systems; since 2004 to be precise.

What Barry means by free is at no additional cost, and it's an important difference. And I don't  think buyers are fooled by this use of the word. It's the total cost that you need to look at, including the free. He eventually gets there; right in the last-but-one paragraph where it's "all at no extra charge".

Free? There's always free cheese in mouse traps, but the mice there aren't best pleased.

.

July 07, 2009

LeftHand Quorums and Split Brains

image_thumb3_thumb3There's been a bit of a discussion going on between John Spiers (HP LeftHand) and myself over a number of issues I raised about a LeftHand SAN's storage efficiency. The comments moved on to talk about HA (high availability) and data reliability, and John raised a number of questions that I thought deserved a longer answer.

John Martin of NetApp (who also blogs on NetApp's Storage Efficiency blog) has kindly provided me with more detail, and rather than post this as a comment, I thought it worthy (again) of a blog in its own right. Thanks to John for the responses.

John Spiers' points are in blue (and I think I've accurately captured them, but they've been lightly [edited] for context). They're also slightly out of order.

In summary, I think there's a need for greater clarity on LeftHand best practices. Everyone, including the user quoted below, appears to be operating in the dark. As I said in a previous post, we've had to do a little"reading between the lines" and work from first principles. If any of this is wrong, please let me know, and I'll correct it.

[Update 09July: Many LeftHand manuals, including best practices, appeared in late June/July, but weren't there when I first looked in late May/early June. Much reading to do!]


NetApp can't deliver this level of HA with auto failover and failback [compared to a LeftHand SAN]

It depends on what you mean by "auto failback". If you mean failback initiated without permission or authorization from a responsible human being, then you'd be correct. If you mean an automated process with minimal user intervention, then you're wrong. From a NetApp perspective, and that of most storage and systems professionals I've talked to when discussing failback requirements, automated and uncontrolled failback is usually judged to be a bad idea.

NetApp also have an excellent product in addition to MetroCluster, called MultiStore. This provides similar kinds of functionality (automated failover and failback) over standard IP connections.

Can SnapMirror or MetroCluster [...] incrementally rebuild the primary site, while maintaining application state and data integrity – i.e. RPO=0 and 100% uptime? I didn’t think so.

Yes it can. As soon as the administrator believes it to be safe, the "failback" process is initiated, and an automatic incremental rebuild and resync provides seamless failback.

[quoting from a NetApp document] "Mirrored active/active configurations do not provide the capability to fail over to the partner node if one node is completely lost. For example, if power is lost to one entire node, including its storage, you cannot fail over to the other node. For this capability, use a MetroCluster"

[quotes sections about manual cluster failover and prevention of “split brain” and JS then says] LeftHand has distributed quorum management that eliminates all possibilities of a “split brain”. This allows at least one site to operate and then automatically resync the other sites when they come back online.

This is an interesting quote, which when taken out of its correct context (a practice commonly called "quote mining") makes it sound much worse than when it's put in context.

  1. It should be noted that this is for a VERY unusual configuration, one that I’ve never seen go into production at any site. The comment is under the heading “Mirrored Active/Active Configurations”. With this, NetApp's Syncmirror is added to a standard active/active configuration to provide a second level of mirroring across the disks, but without a MetroCluster license for local (or stretch) functionality (suitable for distances of around 500m).
  2. Should a customer decides that his data required the extra protection of Syncmirror (i.e. RAID 6+1), then we would recommend the extra resilience provided for in this configuration by using MetroCluster.

The reason for this behavior is to avoid a “split brain” scenario. There is no way of automatically detecting the difference between a total failure of a system, or the the failure of all forms of communication between them. It's this that causes "split brain"; two or more running systems thinking the other has failed, when only the communication between them has failed.

This applies to every conceivable cluster configuration. That includes quorum disks, proxy nodes and other cluster node failure detection techniques; they are all effectively forms of communication between nodes.

Distributed quorum management requires at least three nodes. It is not possible to have a quorum based on two nodes.

I’m not letting you brush this one under the rug. MetroCluster and SyncMirror don’t provide the same level of availability that is inherent in LeftHand’s base SAN/iQ software offering, and LeftHand requires no additional equipment.

That's interesting, because here's a LeftHand user experience;

"So at work we decided to go with a Lefthand Implementation of iSCSI, I am rather unhappy to find out that with only 2 units you have to run a virtual machine to complete the Quorum for redundancy. I am not happy to find out that in reality you need 3 appliances to complete a Quorum for management and to ensure that you have redundancy and that everything is available."

You should have been clearer about comparing a three or more node site vs. a two node site. This HA "no additional equipment solution" is growing legs.

And while I’ll give kudos for an n-way implementation, unlike the NetApp documentation you quote, you don’t address how you handle what happens when there is a complete loss of communication only from the primary site to all other sites. None of the other nodes are able to detect whether the site is a smoldering pile of rubble, or if it’s just incommunicado, and if you make the assumption that lack of communication is a trigger for failover, then you have a recipe for split brain.

The manual “declared disaster” approach is the safest way of dealing with this. If a customer believes they have a foolproof way of detecting a true site failure, then its trivially easy to integrate this by automating MetroCluster failover (or failback for that matter)

Yes, [HP's LeftHand] Network RAID 2 can sustain any random 5 disk failure. In fact, a 4 array system configured in Network RAID level 2 can sustain up to 2 complete array failures, and up to 6 disk failures in each of the remaining 2 arrays, and all at the same time.

Hmmm... Picture courtesy of HP's LeftHand P4000 brochure 4AA2-5247ENW, April 2009.

image_thumb5

Let's start with nodes. I have two copies of my data for nRAID2, and four arrays. If any two nodes fail, then I lose data. Which two arrays can I blow away in this four node system, and not lose data in my logical volume? With this simplistic 4 block LUN, 1 and 3 or 2 and 4. Unless you can arrange your node failures in advance, I'd suggest that is one node protection.

And the statement that I can survive any random 5 disk failure is equally implausible given the LeftHand diagram above. Two disk failures in node 1 in the same RAID5 group would kill node 1. A further two disk failures in a RAID5 group kill node 2, and we now have two node failures which can't be survived. The answer for nRAID2 is 3 random disk failures, not 5.

I'm not even going to try and work out what "up to 6 disk failures in the remaining 2 arrays" really means, although I'll bet the words "random" and "quorum" don't feature heavily. 

Of course, this all assumes the HP LeftHand diagram above is correct, and not just a graphic designer's interpretation of the facts. The diagram below for the terrible usable capacity of a LeftHand system for 1 node failure protection (split brained or otherwise) is correct. I've had it double checked by HP.

LeftHand nRAID2 means 35% usable capacity

.

July 01, 2009

30 Years Ago; Cloud Deja VU

image There is, as they like to say in Yorkshire, nowt new under the sun.

Here's an announcement from IBM that covers not only advanced cloud virtualization (much more advanced than we have today!) but also CDP (Continuous Data Protection) of VMs too. IBM's Virtual Universe Operating System.

The announcement date for OS/VU? Three decades ago, in 1979.

I remember this spoof rather well. In 1982, some helpful soul at IBM's Yorktown Heights sent me this to cheer me up. I was spending long evenings and nights on the phone with IBM, struggling to install APL and a planning application running under VM/CMS on a mainframe.

I got there eventually. I was probably one of the first of a handful of people at that time that managed to deliver a virtualized, containerised and cloud-like environment for end users.

I even provided the equivalent of vMotion. On tape. Ah, how little times change.

This month also marks my 30th year in the IT industry. I started out with no specific plans except to do what I enjoyed most; getting to play with -- ahem, work with computers, and getting paid for it. It's been a great three decades.

No cards or flowers, please.


NEW IBM OPERATING SYSTEM

Because so many users have asked for an operating system of even greater capability than VM, IBM announces the Virtual Universe Operating System -- OS/VU.

Running under OS/VU, the individual users appears to have not merely a machine of his own, but an entire universe of his own, in which he can set up and take down his own programs, data sets, systems networks, personnel, and planetary systems. He need only specify the universe he desires, and the OS/VU system generation program (IEHGOD) does the rest. This program will reside in SYS1.GODLIB. The minimum time for this function is 6 days of activity and 1 day of review. In conjunction with OS/VU, all system utilities have been replaced by one program (IEHPROPHET) which will reside in SYS1.MESSIAH. This program has no parms or control cards as it knows what you want to do when it is executed.

Naturally, the user must have attained a certain degree of sophistication in the data processing field if an efficient utilization of OS/VU is to be achieved. Frequent calls to non-resident galaxies, for instance, can lead to unexpected delays in the execution of a job. Although IBM, through its wholly owned subsidiary, The United States, is working on a program to upgrade the speed of light and thus reduce the overhead of extraterrestrial and metadimensional paging, users must be careful for the present to stay within the laws of physics. IBM must charge an additional fee for violations.

OS/VU will run on any x0xx equipped with Extended WARP Feature. Rental is twenty million dollars per cpu/nanosecond.

Users should be aware that IBM plans to migrate all existing systems and hardware to OS/VU as soon as our engineers effect one output that is (conceptually) error-free. This will give us a base to develop an even more powerful operating system, target date 2001, designated "Virtual Reality". OS/VR is planned to enable the user to migrate to totally unreal universes. To aid the user in identifying the difference between "Virtual Reality" and "Real Reality", a file containing a linear arrangement of multisensory total records of successive moments of now will be established. Its name will be SYS1.EST.

For more information, contact your IBM data processing representative.

Mainframe humour! Honestly, it is funny, but perhaps you had to be there to get the jokes.

[I'm on holiday this coming week. Normal service will be resumed shortly.]

.

June 26, 2009

An HP LeftHand Triplication Calculator

Part 1 of my investigation of LeftHand's claim to save money, given its less than stellar 35% usable from raw disk space, generated a number of interesting replies from HP. One is worthy of more analysis, but as it's a bit big for a comment, I've taken the liberty of extracting John Spier's reply to me. John was the former CTO of LeftHand Networks prior to its acquisition by HP.

There's a lack of information on how LeftHand does its stuff publicly available, so I've had to do a little"reading between the lines" and work from first principles. If I've got any of this wrong, please let me know, and I'll correct it. I've added a running commentary to John's comment; my apologies for breaking it up, but it's all here (in blue for clarity).

When using Network RAID 2 it protects you from multiple disk faults, complete array faults and site faults with auto failover and failback. NetApp can’t deliver this level of HA with auto failover and failback. Features like MetroCLuster give you data protection, but not HA, and at a lower capacity utilization than LeftHand. Can SnapMirror or MetroCluster automatically fail back, incrementally rebuild the primary site, while maintaining application state and data integrity – i.e. RPO=0 and 100% uptime? I didn’t think so.

Then you'd be wrong; that's exactly what MetroCluster is about. Except the auto failback; that's just adding a disaster on top of a disaster. But I digress, and that's the subject for another post. 

It's worth pointing out before I analyse this claim that I originally thought that LeftHand had come up with a new paradigm with its network RAID; that it provided both data protection and high availability built on commodity tin. I was wrong; if it was that easy, we'd have done it. But NetApp's 15 years experience in doing this stuff has taught us otherwise.

First, I need to explain what a NetApp cluster is all about; then we can compare and contrast, and ask some questions.

NetApp Data Protection and HA

NetApp uses NVRAM (non volatile RAM) and transaction logging to capture writes to disk. All writes are acknowledged before they get to disk, but only after they've been logged. This means that when a single controller (non-HA) fails, the data we said to the server or client that we wrote, but stored in NVRAM, is replayed to the disks when the system comes back up. That way, we guarantee what we promised when we said we'd written the data. The data is both consistent and durable.

In a cluster or HA solution, we ensure cache coherency; the contents of the NVRAM on controller 1 are mirrored to controller 2. That way, when controller 1 goes down, controller 2 can replay controller 1's writes, and take over its workload, as the disks are addressable from both controllers. Again, the data is both consistent and durable; and we've made it HA, without downtime to the application. It carries on running.

If controller 2 now fails (or both fail together), we're still consistent and durable; see above for the single controller case.

Lastly, SATA drives in particular can suffer from "lost writes". Every drive has a cache where it stores data to be written. This is separate from any other protected cache, for instance NetApp's NVRAM.

As soon as an IO hits this buffer, the drive acknowledges the write. But blocks can subsequently be written in the wrong place, or not written at all, especially if there's a disk failure between acknowledgement and the physical write. 

Because NetApp has the ability to control both the RAID and the file system, Data ONTAP 7G provides the unique ability to catch errors such as this and recover. Along with a block checksum, ONTAP also stores WAFL metadata (the inode # of a file containing the block) that provide the ability to verify the validity of a block being read. If the block being read does not match what WAFL expects, the data gets reconstructed ensuring that your data is both consistent and durable.

NetApp goes to extraordinary lengths to protect your data.

Here's the issue. I can't see this level of protection in a LeftHand SAN.

LeftHand Data Protection and HA

LeftHand systems are built from commodity servers and use a battery backed RAID controller to provide a log of writes to disk. This means that when a single node (non-HA) fails, the data in the cache is replayed to the disks when the system comes back up.  

But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?

In a cluster, and using nRAID2, the IO is copied to a second node, the same scenario as in the single node case is played out. Effectively, cache coherency is provided by mirroring the data to a second (or third, or fourth) node across the network, which is slower and adds to latency.

But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?

With nRAID2, when a node fails, you want to guarantee your writes because the other node(s) are now holding a single copy of your data. That nRAID2 is now no more than RAID5 striped across one or more nodes, writing data through disk cache.

But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?

The choices to protect your data?

  • Turn on write-through (turn off the disk cache and force IO straight to disk). On all disks on all nodes.
  • Choose nRAID3 so you have a second mirror.

The first option causes huge IO performance problems; drives that are forced to write directly perform very badly indeed.

The second option of nRAID3 is the only alternative.

And every read is still fraught with danger. You may read a block from node A but get a completely different data from node B for the same block request -- because there's no guarantee of protection against lost writes on any of the nodes.

The LeftHand Triplication Calculator

Ok, let’s talk capacity. All NetApp’s customers know NetApp’s storage utilization is below 50% when using best practices.

NetApp's best practices are here. See page 20, section 7.4 Best Practice Configurations. This is the same old HP tap dancing.

But instead of re-hashing what everyone already knows, let’s do a simple calculation for a highly available multi-site SAN using MetroCluster (as a side note, you know Calvin is taking it easy on you with the MetroCluster pricing.)

Let’s say a customer has 10TB of NetApp raw storage at the primary site and they replicate that 10TB to a remote site for HA and disaster protection. Storage utilization is now 50% (10TB/20TB.) Take your 63% at both sites, and we won’t bother to include things like the space taken up for NetApp’s root volume and replication log files. 63% of 10TB leaves you 6.3TB of usable capacity replicated. This means you can create 6.3TB of data out of 20TB raw. That’s 31.5%.

With LeftHand’s Network RAID level 2 you can split your SAN across 2 sites for a better HA solution and the customer’s utilization, according to you, is better than NetApp’s – 35%.

Except it isn't anywhere near MetroCluster in terms of HA and data protection -- in fact it's nowhere near a single controller NetApp system in terms of data protection.

LeftHand is now down to 24% usable for an inferior solution.

image

24% usable

The Rest

Now it’s time for some education:

Network RAID is set at the volume level. Not all volumes require the advance data protection level of Network RAID level 2, therefore utilization is typically much better if used at a single site.

At which point, you now have a RAID5 solution with no lost write protection (or you turn on write-through on every disk and suffer huge IO penalties). Really, there's no point. Might as well buy a cheap Linux server and be done with it.

If you really want to get schooled let’s talk about dual-parity based network RAID and what that does for utilization. What we should really be talking about is cost and capacity utilization of NetApp GX vs. LeftHand, because that is the only product that comes close to LeftHand’s architecture. Does GX support block yet?

No, let's focus first on my claim that LHN means more space, more power, more cooling, more cost per usable TB for an inferior solution. Address that, and then you get bragging rights.

© NetApp, Inc.  |  "Safe Harbor" Statement  |  Privacy Policy