« An HP LeftHand Duplication Calculator | Main | 30 Years Ago; Cloud Deja VU »

June 26, 2009

Comments

Let's make myself clear: any unrecoverable lost write error is one too many.

What I would like to know: what are the chances getting such an error? What is the "Mean Time Between Lost Write Errors"? What are the chances I've got to fall back on tape or whatever to recover an environment corrupted by lost writes?

When using Network RAID 2 it protects you from multiple disk faults, complete array faults and site faults with auto failover and failback. NetApp can’t deliver this level of HA with auto failover and failback. Features like MetroCLuster give you data protection, but not HA, and at a lower capacity utilization than LeftHand. Can SnapMirror or MetroCluster automatically fail back, incrementally rebuild the primary site, while maintaining application state and data integrity – i.e. RPO=0 and 100% uptime? I didn’t think so.

>>“Then you'd be wrong; that's exactly what MetroCluster is about. Except the auto failback; that's just adding a disaster on top of a disaster. But I digress, and that's the subject for another post.”

I’m not letting you brush this one under the rug. MetroCluster and SyncMirror don’t provide the same level of availability that is inherent in LeftHand’s base SAN/iQ software offering, and LeftHand requires no additional equipment.

I found the following quote in NetApp’s Data ONTAP Active/Active Configuration guide:

“Mirrored active/active configurations do not provide the capability to fail over to the partner node if one node is completely lost. For example, if power is lost to one entire node, including its storage, you cannot fail over to the other node. For this capability, use a MetroCluster”

MetroCluster is nothing more than a standard NetApp cluster that has been stretch or separated. Once it’s deployed you lose local high availability and any major fault will result in a site failover.
The standard MetroCluster solution requires manual intervention for both failover and failback. So the claim of an automated and transparent failover solution is false.
Each Filer head must have a licensed copy of the following software: SyncMirror, Cluster license, Remote Cluster license and MetroCluster License. How is that simple to deploy and manage?

LeftHand replicates over standard IP networks. LeftHand has customers synchronously replicating over 100km over standard Ethernet networks with latencies of <3ms.

NetApp MetroCLuster is limited to 500m over IP. For distances over 500m it requires four Fibre Channel switches in a dual-fabric configuration, Fibre to IP bridge equipment, and a separate cluster interconnect card.

For example, a Fabric MetroCluster requires:
An active-active pair of FAS900, 3000, 3100 (two single controller chassis), or 6000 series controllers running Data ONTAP 6.4.1 or later
1. Four Brocade Fibre Channel switches with supported firmware supplied by NetApp - a pair at each location.
2. Brocade Extended Distance license (if over 10km)
3. Brocade Full-Fabric license
4. Brocade Ports-on-Demand (POD) licenses for additional ports
5. A VI-MC cluster adapter
6. A syncmirror local license
7. A cluster remote license
8. A cluster license
9. Associated cabling

MetroCLuster requires a full-copy resynch for SyncMirror rejoin and is at risk if a second failure event occurs before resynch is complete. LeftHand performs an automated failback with changed data resynch of the primary site.

Auto-failover you say? Let’s look at page 15 of MetroCluster Design and Implementation Guide

“Upon determining that one of the sites has failed, the administrator must execute a specific command on the surviving node to initiate a site takeover. The command is: cf forcetakeover –d The takeover is not automatic because there may be cases in an active-active configuration in which the network between sites is down and each site is still fully functional. In this case a forced takeover might not be desirable.”

“The cf forcetakeover command previously described allows the surviving site to take over the failed site’s responsibilities without a quorum of disks available at the failed site (normally required). Once the problem at the failed site is resolved, the administrator must follow certain procedures, including restricting booting of the previously failed node. If access is not restricted, a split brain scenario may occur. This is the result of the controller at the failed site coming back up not knowing that there is a takeover situation. It begins servicing data requests while the remote site also continues to serve requests. The result is the possibility of data corruption.”

LeftHand has distributed quorum management that eliminates all possibilities of a “split brain”. This allows at least one site to operate and then automatically resync the other sites when they come back online. In a Lefthand SAN a volume instance spans both sites; it is not a copy of a volume. This means the applications see the same volume serial number and metadata that was on the primary site, because it’s the same volume. This makes application failover and failback seamless.
Since NetApp uses a copy of a volume instead of the same volume they document the following:

“iSCSI and Fibre Channel LUNS may need to be rescanned by the application if the application (i.e., VMware®) relies on the LUN serial number. When a new FSID is assigned, the LUN serial number changes.”
I won’t get into automatic failback, because you already admitted NetApp can’t do that, plus it’s a mess and would take another page on this blog to describe it.

>>It's worth pointing out before I analyse this claim that I originally thought that LeftHand had come up with a new paradigm with its network RAID; that it provided both data protection and high availability built on commodity tin. I was wrong; if it was that easy, we'd have done it. But NetApp's 15 years experience in doing this stuff has taught us otherwise.

It’s this big company ego that allows storage startups to do what they do best, think outside the box.

I can’t get into details about how LeftHand’s software works, but I can say the following:
1. HP RAID controllers are among the most technologically advanced controllers in the market, and they’ve been developing RAID controllers for longer than NetApp’s been in existence. The RAID controllers contain battery backed cache and do not release the cache until the data is on disk. We are not exposed to “lost writes.”
2. We do background scrubbing and recreate or reallocate bad disk sectors off of parity or Network RAID.
3. We can have any number of controllers reading and writing to any number of volumes. For example, a 4 node system has 4 controllers performing I/O, and with Network RAID up to 2 controllers can fail without the SAN going down. A 6 node system can support 3 controller failures, etc.
>>LeftHand Data Protection and HA
>>LeftHand systems are built from commodity servers and use a battery backed RAID controller to provide a log of writes to disk. This means that when a single node (non-HA) fails, the data in the cache is replayed to the disks when the system comes back up.

This is a completely false statement.

>>But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?

Yes

>>In a cluster, and using nRAID2, the IO is copied to a second node, the same scenario as in the single node case is played out. Effectively, cache coherency is provided by mirroring the data to a second (or third, or fourth) node across the network, which is slower and adds to latency.

There is a slight write penalty, but reads are much faster, which means there is very little performance difference when running Network RAID with a typical 60/40 read/write workload.

>>With nRAID2, when a node fails, you want to guarantee your writes because the other node(s) are now holding a single copy of your data. That nRAID2 is now no more than RAID5 striped across one or more nodes, writing data through disk cache.

True. Just like NetApp has a single point of failure after a controller fails.


>>And every read is still fraught with danger. You may read a block from node A but get a completely different data from node B for the same block request -- because there's no guarantee of protection against lost writes on any of the nodes.

This is obviously not true.

>>At which point, you now have a RAID5 solution with no lost write protection (or you turn on write-through on every disk and suffer huge IO penalties). Really, there's no point. Might as well buy a cheap Linux server and be done with it.

Like I said before, we are not exposed to lost writes.

Look under the hood of a NetApp controller head and you will see the same x86 chipset that is in a LeftHand node.

BTW, we don’t have performance degradation over time like NetApp because we always allocate blocks in sequential “strides”, and re-layout data in the background if required to keep I/O as sequential as possible. There is little to no performance degradation as the LeftHand SAN fills up.

>>No, let's focus first on my claim that LHN means more space, more power, more cooling, more cost per usable TB for an inferior solution. Address that, and then you get bragging rights.

Ok:

If you take the power used in a NetApp controller head and amortize it over the disk shelves, and add it to the power consumption of the disk shelves, NetApp uses more power than LeftHand. LeftHand also uses SAS drives instead of Fibre Channel drives, which consume less power. Other energy savings features incorporated into server architectures help reduce power even further.

LeftHand use less space because we don’t have the controller head and our 2U 12 drive disk shelf is 6 drives per U, and NetApp’s 14 drive shelf is only 4.66 drives per U.

Our storage utilization with best practice configurations is comparable to NetApp’s in a non-HA configuration and better in HA configurations. When using Parity Based Network RAID it is better than NetApp in most configurations.

Looks like I have bragging rights.

John,

1. The RAID controllers contain battery backed cache and do not release the cache until the data is on disk. We are not exposed to “lost writes.”

2. We do background scrubbing and recreate or reallocate bad disk sectors off of parity or Network RAID

I'm disappointed, as I would have expected that an engineer with your apparent level of storage experience would understand that lost writes occur even when the disk has aknowleged the write back to the raid controller. This can be summarised as "disks occasionally lie". it doesnt matter whether your HP RAID controller keeps the write until it's aknowleged or not, if the disk sends back inaccurate information, you're screwed.

2. A traditional RAID scrub wont pick up this kind of subtle corruption without some kind of semantic knowlege of what data is "meant" to be there. As far as I can tell, you dont have that capability.

3. Even if your scrub does pick this up, what happens if you serve up a block that is subject to a lost write before the raid scrub gets around to fixing it ?

I submit, that despite your protestations that you are indeed subject to data corruption via "lost writes"


Secondly, while I think you've done a good job with the quorums etc, I still think that adding a completely new set of spindles for a basic level of fault resiliency expected by every customer in the industry is wasteful, even if you give the stuff away, it certainly isnt "green".

Mathieu

This is the paper you want to read. (You'll need GhostScript to read this PS file.) Robin Harris summarised some of the findings in his blog that I pointed to.

Also note the difference between media and data scrubbing. NetApp does data scrubbing; it looks like LeftHand does only once a month media scrubbing, something that's quite inferior in its ability to detect latent errors. NetApp goes to exceptional lengths to protect your data.


John Spiers;

If I could find any best practices documents for LeftHand, I might be tempted to cut and paste too. Point me at some, I can't find them.

I'm afraid your whole reply demonstrates confusion between high availability (which you have) and adequate data protection (which you don't have).

John Martin's disappointment is mine too; I expected better from an engineer with your background on the subject of lost writes. FC drives have checksums at the hardware level for that very reason -- data reliability -- and NetApp ensures that non-FC drives like SATA get the same rigourous checksum protecton in software against lost writes. You seem to think they can just be ignored, because your assertion that you don't have the problem is very far from the truth.

And nRAID still makes a nonsense of your colleague Chris McCall's claim

so we can achieve very high rates of capacity utilization and reduce the overall cost of the storage in your environment.

What Chris meant was more space, more power, more cooling, more cost per usable TB for an inferior solution. You do the sums at 34% utilization, and you'll see what I mean.

I posted this on a HP blog last week, but thought it also fits here.

NetApp talks about having 99.999% uptime and providing HA. I came across a Avanade PDF on NetApps website (media.netapp.com/.../Avanade_Testing_Center_NetApp_Whitepaper_Exchange.pdf) that discusses testing the HA functionality of the FAS3050c array. I was quite shocked to find out on page 6 that both hard and graceful failovers caused 2 minutes and 27 seconds of user downtime. Maybe I'm wrong, but I don't see how that downtime is acceptable unless you are using it strictly for NAS. I would hate to see what would happen to a database or VMware if that were to take place while they were running.

I am currently a HP LeftHand Networks customer and have been very satisfied with their product. I have unintentionally as well as intentionally caused HA(using nRAID2) to kick in by failing LH node(s) and have never had a problem. I have never seen any user downtime because of it, and we are running Exchange, SQL, File, VMware, Oracle, etc. It's been nice to be able to do upgrades in the middle of the day without having to worry about downtime. Unless NetApp has made some changes to improve the Filer failover time, there is really no comparison between HP LeftHand and NetApp. From personal experience, when it comes to High Availability, HP LeftHand provides a far superior product. Granted I do not have a NetApp system to gain personal experience with, so I have to take the referenced PDF for what it is worth, especially since it is coming from NetApps website.

Thanks,
Chris

@Chris

Thanks for commenting.

With appropriate timeout settings on applications, there's no downtime, even though the failover may take time to complete.

There have been many updates to NetApp systems since that paper was published. In general, for failover situations the rule of thumb is, for Data ONTAP 7.2 – 120s; and for 7.3 and beyond – 60s.

Notice that these are worst case timings. On average the timings will be much better. And our customers don't have to take downtime to upgrade; just a controlled failover and giveback.

Secondly, the zero RTO/RPO failover scenarios you have tried with LHN equipment are achievable with NetApp systems; see our MetroCluster for true HA and DR robustness.

LHN don't have any certification or user-measurable uptime statistics. Ours are drawn from our AutoSupport system that reports, in detail, information about the performance and availability of NetApp systems, from which we collect uptime statistics. IDC have certified NetApp's methodology for uptime measurement which is currently in excess of 99.999% for a single (non-HA) system, and near 6 9s for HA systems.

That's the difference I see every day. We can back up our claims; can LHN?

@Chris

Just one last observation; are you using VMware's vMotion (you indicate you're a VMware user)?

If you are, then motion times of minutes are usual and nobody considers that downtime, since the applications are still up. Applications that can't tolerate these kind of failover scenarios aren't much use in a virtualised environment either. Most are, espcially those you mention; Exchange, SQLserver, Oracle etc.

@Alex,

Thanks for your replies and clarification. It's good to know that failover time is now in the 60s. That sounds a bit more reasonable. Is there a difference between a controlled failovers and uncontrolled failovers?

I am glad to see that you compare LHN zero RTO/RPO failover with NetApp using MetroCluster. When comparing HA/DR functionality between the two, I would have to say that NetApp would have to run MetroCluster in order to really compare the two. Even though LHN(with at least nRAID2) and NetApp with MetroCluster have similar HA functionality, there still is a couple of key differences that I can think of off the top of my head between the two.

Please correct me if I am wrong with any of these.

1. NetApp does not load balance between the filer and storage in each site(up to 2 sites). One location is the active and primary location. - LHN load balances between all controllers/storage in all site(s)(up to 4 sites).
2. When NetApp fails over, VMware has to do a rescan because the volume signature changes. Wouldn't this cause downtime? This does not happen with LHN.
3. With LHN, you can change which volumes to enable HA with, this can be done on the fly, turn on/off and your choice of no replication, nRAID(2), nRAID(3), and nRAID(4).
4. In the event of a site failure/recovery, LHN automatically syncs all changes back to the site without any user intervention which resumes HA protection level and controllers/storage become instantly available. NetApp requires administrator intervention to gracefully fail it back.
5. Last but not least, the MetroCluster solution from NetApp costs about twice as much, if not more. I know this because we recently did a comparison between the two and chose LHN over NetApp.

In regards to your comment about VMware, I assume you meant VMotion instead of motion? I am not sure what you mean that it takes minutes to VMotion a virtual machine. I have not seen that, I've seen it take up to 10 sec paused/downtime to VMotion. Are the minutes you mentioned, due to the storage underneath? Also when talking about the apps in VMware such as Exchange, SQLServer and Oracle, do you normally use vmdks for their data/log files? I use iSCSI within the VM's and I know that if I disconnect the network for long enough (30 sec.) that it can cause the database to crash. So I am still skeptical of a database application staying up during 60s of storage failover/downtime.

That's my thoughts. Take them for what they are worth and let me know what you think.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

© NetApp, Inc.  |  "Safe Harbor" Statement  |  Privacy Policy