There's been a bit of a discussion going on between John Spiers (HP LeftHand) and myself over a number of issues I raised about a LeftHand SAN's storage efficiency. The comments moved on to talk about HA (high availability) and data reliability, and John raised a number of questions that I thought deserved a longer answer.
John Martin of NetApp (who also blogs on NetApp's Storage Efficiency blog) has kindly provided me with more detail, and rather than post this as a comment, I thought it worthy (again) of a blog in its own right. Thanks to John for the responses.
John Spiers' points are in blue (and I think I've accurately captured them, but they've been lightly [edited] for context). They're also slightly out of order.
In summary, I think there's a need for greater clarity on LeftHand best practices. Everyone, including the user quoted below, appears to be operating in the dark. As I said in a previous post, we've had to do a little"reading between the lines" and work from first principles. If any of this is wrong, please let me know, and I'll correct it.
[Update 09July: Many LeftHand manuals, including best practices, appeared in late June/July, but weren't there when I first looked in late May/early June. Much reading to do!]
NetApp can't deliver this level of HA with auto failover and failback [compared to a LeftHand SAN]
It depends on what you mean by "auto failback". If you mean failback initiated without permission or authorization from a responsible human being, then you'd be correct. If you mean an automated process with minimal user intervention, then you're wrong. From a NetApp perspective, and that of most storage and systems professionals I've talked to when discussing failback requirements, automated and uncontrolled failback is usually judged to be a bad idea.
NetApp also have an excellent product in addition to MetroCluster, called MultiStore. This provides similar kinds of functionality (automated failover and failback) over standard IP connections.
Can SnapMirror or MetroCluster [...] incrementally rebuild the primary site, while maintaining application state and data integrity – i.e. RPO=0 and 100% uptime? I didn’t think so.
Yes it can. As soon as the administrator believes it to be safe, the "failback" process is initiated, and an automatic incremental rebuild and resync provides seamless failback.
[quoting from a NetApp document] "Mirrored active/active configurations do not provide the capability to fail over to the partner node if one node is completely lost. For example, if power is lost to one entire node, including its storage, you cannot fail over to the other node. For this capability, use a MetroCluster"
[quotes sections about manual cluster failover and prevention of “split brain” and JS then says] LeftHand has distributed quorum management that eliminates all possibilities of a “split brain”. This allows at least one site to operate and then automatically resync the other sites when they come back online.
This is an interesting quote, which when taken out of its correct context (a practice commonly called "quote mining") makes it sound much worse than when it's put in context.
- It should be noted that this is for a VERY unusual configuration, one that I’ve never seen go into production at any site. The comment is under the heading “Mirrored Active/Active Configurations”. With this, NetApp's Syncmirror is added to a standard active/active configuration to provide a second level of mirroring across the disks, but without a MetroCluster license for local (or stretch) functionality (suitable for distances of around 500m).
- Should a customer decides that his data required the extra protection of Syncmirror (i.e. RAID 6+1), then we would recommend the extra resilience provided for in this configuration by using MetroCluster.
The reason for this behavior is to avoid a “split brain” scenario. There is no way of automatically detecting the difference between a total failure of a system, or the the failure of all forms of communication between them. It's this that causes "split brain"; two or more running systems thinking the other has failed, when only the communication between them has failed.
This applies to every conceivable cluster configuration. That includes quorum disks, proxy nodes and other cluster node failure detection techniques; they are all effectively forms of communication between nodes.
Distributed quorum management requires at least three nodes. It is not possible to have a quorum based on two nodes.
I’m not letting you brush this one under the rug. MetroCluster and SyncMirror don’t provide the same level of availability that is inherent in LeftHand’s base SAN/iQ software offering, and LeftHand requires no additional equipment.
That's interesting, because here's a LeftHand user experience;
"So at work we decided to go with a Lefthand Implementation of iSCSI, I am rather unhappy to find out that with only 2 units you have to run a virtual machine to complete the Quorum for redundancy. I am not happy to find out that in reality you need 3 appliances to complete a Quorum for management and to ensure that you have redundancy and that everything is available."
You should have been clearer about comparing a three or more node site vs. a two node site. This HA "no additional equipment solution" is growing legs.
And while I’ll give kudos for an n-way implementation, unlike the NetApp documentation you quote, you don’t address how you handle what happens when there is a complete loss of communication only from the primary site to all other sites. None of the other nodes are able to detect whether the site is a smoldering pile of rubble, or if it’s just incommunicado, and if you make the assumption that lack of communication is a trigger for failover, then you have a recipe for split brain.
The manual “declared disaster” approach is the safest way of dealing with this. If a customer believes they have a foolproof way of detecting a true site failure, then its trivially easy to integrate this by automating MetroCluster failover (or failback for that matter)
Yes, [HP's LeftHand] Network RAID 2 can sustain any random 5 disk failure. In fact, a 4 array system configured in Network RAID level 2 can sustain up to 2 complete array failures, and up to 6 disk failures in each of the remaining 2 arrays, and all at the same time.
Hmmm... Picture courtesy of HP's LeftHand P4000 brochure 4AA2-5247ENW, April 2009.
Let's start with nodes. I have two copies of my data for nRAID2, and four arrays. If any two nodes fail, then I lose data. Which two arrays can I blow away in this four node system, and not lose data in my logical volume? With this simplistic 4 block LUN, 1 and 3 or 2 and 4. Unless you can arrange your node failures in advance, I'd suggest that is one node protection.
And the statement that I can survive any random 5 disk failure is equally implausible given the LeftHand diagram above. Two disk failures in node 1 in the same RAID5 group would kill node 1. A further two disk failures in a RAID5 group kill node 2, and we now have two node failures which can't be survived. The answer for nRAID2 is 3 random disk failures, not 5.
I'm not even going to try and work out what "up to 6 disk failures in the remaining 2 arrays" really means, although I'll bet the words "random" and "quorum" don't feature heavily.
Of course, this all assumes the HP LeftHand diagram above is correct, and not just a graphic designer's interpretation of the facts. The diagram below for the terrible usable capacity of a LeftHand system for 1 node failure protection (split brained or otherwise) is correct. I've had it double checked by HP.
LeftHand nRAID2 means 35% usable capacity
.

Alex - Chad @ EMC here - what's interesting to me is that we're comparing apples to oranges here. Multistore isn't really analagous, though Metrocluster is in the same sphere - they are very different from Lefthand's implementation.
While I don't claim to be an expert in NetApp or Lefthand, (though I have and use both products just so I can know what I'm talking about) - their's is functionally a distributed volume manager across a single iSCSI target - it's very different than a Metrocluster which is a "FAS cluster with distance between the two heads", with a SyncMirror. Failover behavior is essentially the same as if they were sitting right beside one another - BUT ONLY IF the remote head can still access the original aggregate, and not the syncmirror.
Even in that case, it's still different.
Each has pros and cons (for example, the iSCSI method LHN uses is very difficult for NAS and FC/FCoE use cases as it uses iSCSI redirection techniques. And yes, there is an impact to useable capacity (and really you do need three nodes to make it work with LHN in the stretched way) - but frankly the whole usable capacity thing is a bit orthogonal - when you're talking about this, there's WAY, WAY more to the discussion.
What always drives me NUTS is that no-one talks about the non-storage topics on this (in VMware land - there are a TON of VM HA and DRS considerations never discussed). I did a post on this here: http://virtualgeek.typepad.com/virtual_geek/2008/06/the-case-for-an.html
Personally, I think before we figure out those higher level questions (which vary from use case to use case - are little in the NAS scenarioes, but complex in the VMware and other scenarioes), it's "bad storage vendor games" (make feature X that I have a requirement so I can compete).
With everyone (everyone!) this use case even has storage downsides (in LHNs case, rebuilds across distance, in NetApp's case the impact to local HA, cabling/zoning complexity, and the SyncMirror bit - that always gets glossed over - and EMC's solutions also have caveats).
Just a bit of seperate input on this... For what it's worth.
Posted by: Chad Sakac | July 08, 2009 at 01:50 PM
Chad
Thanks for dropping by.
You make some good points here, but just so as I understand, could you expand on
I didn't quite get the jist of that; do you mean that feature X may not be a requirement, but that we (either LeftHand or NetApp) are erroneously positioning feature X as an absolute requirement?You're spot on about the non-storage (or application) aspects of HA, which as you point out is another whole topic. I've taken the liberty of HTMLifying your blog link so it can be followed more easily. And at this point, I'd like to assert (if it wasn't already obvious) that I'm no HA storage expert. Most of this was refuting John Spiers' (LeftHand) "apples to oranges" comparisons, and John Martin of NetApp is much more epxerienced in this area, so I'm going to defer some of the finer points you raise to him and others.
I'd disagree about capacity utilization being orthogonal; it might be to a discussion about HA, but it's certainly not when you're faced with a system that returns 1/3rd of the advertised capacity, HA or otherwise. LeftHand might want to move the discussion elsewhere (and why not?) but I'm focussed on the stuff that brings our customers storage efficencies. And, given that LeftHand isn't an enterprise class play, efficency and cost are the pain points for small to medium sized customers.
DSM for MPIO (the LeftHand server-side software that provides iSCSI redirection) is a limitation, as you point out, and it will make life difficult for both LeftHand and it's customers to take advantage of technologies like FCoE. That's probably academic to the target customer for this system. Right now, a much bigger limitation is the lack of support for NAS (CIFS in particular) without additional hardware & software. That's a big must-have for customers in this part of the market looking to maximise the return on their money.
Not to mention deduplication. Without it, that 1/3rd usable looks even worse.
Posted by: Alex McDonald | July 09, 2009 at 03:25 AM
I totally get what Chad is saying and it's the same point I made in my blog questioning the value of this debate: http://bit.ly/vY6U3. It's kind of like customers wanting to compare the grocery store prices and you want to debate paper or plastic bags. At the heart of it, capacity utilization is about cost (and by the way, your calculator is still wrong); if you want to hit the mark, lets talk about the overall cost of the solutions. But I'm guessing you don't want your potential customers to experience sticker shock when they learn the cost of your software licensing compared with HP LeftHand SANs cost ZERO.
Yet cost is still only one element of the tradeoffs customers make. Your focus on capacity utilization isn't helping customers better decide between HP and NetApp but I'm not sure that's your ultimate goal.
I also have a new post summarizing the debate you started around capacity utilization - a score card was needed to follow this one so here's a link: http://bit.ly/aN0Ve.
Posted by: Calvin Zito | July 09, 2009 at 02:58 PM
Calvin
Please, everyone knows that the cost of a LeftHand SAN includes the software. Otherwise the price of a LeftHand SAN would be the same as the server on which it's based, and you'd have SAN/iQ out there for free download. Don't think I or your customers are dopes.
If the calculator is still wrong, then you have a chance to correct it. Please do so, because I'm gathering information about requirements, and comparative solutions and costs. A colleague is collecting data from customers about real world use. This takes more than just a few days, but I'm sure that the issue of cost won't go away in the meanwhile.
As for tradeoffs; which ones? If the focus on capacity utilization isn't helping customers better decide between HP and NetApp, suggest what does. Performance? Reliability? TCO?
Posted by: Alex McDonald | July 09, 2009 at 03:23 PM
@ Calvin -
Your comment about "capacity utilization is about cost", while this has elements of truth, I think you're mixing up "price" with "cost".
It's possible that HP's "price" for a certain amount of usable storage may be lower than the price of an equivalent amount of usable storage from NetApp if they choose to discount deeply enough. Discounting is hardly an exercise in business skill or technical merit.
Regardless putting lots of spindles into a datacenter has many "costs" including power, space, cooling etc. Given that most datacenters are already bulging at the seams it should be noted that a new datacenter, or putting in bigger airconditioners because of poor storage efficiency is hardly "green" regardless of whether you're talking about greenbacks, or CO2.
Remember, the genesis of this entire debate was HP's claim that Lefthand was 'green' storage. Somehow I dont think HP really gets what that really means.
Posted by: John Martin | July 20, 2009 at 08:47 PM
A LeftHand cluster solution is provided with Virtual Manager & Failover Manager to provide majority/quorum decision in case of node failure/split
Posted by: test | August 25, 2009 at 08:23 AM
@test
Yes, but it requires yet another physical server, preferably located in a third location that has connectivity to the primary and DR sites.
Posted by: Alex McDonald | August 25, 2009 at 08:29 AM