Part 1 of my investigation of LeftHand's claim to save money, given its less than stellar 35% usable from raw disk space, generated a number of interesting replies from HP. One is worthy of more analysis, but as it's a bit big for a comment, I've taken the liberty of extracting John Spier's reply to me. John was the former CTO of LeftHand Networks prior to its acquisition by HP.
There's a lack of information on how LeftHand does its stuff publicly available, so I've had to do a little"reading between the lines" and work from first principles. If I've got any of this wrong, please let me know, and I'll correct it. I've added a running commentary to John's comment; my apologies for breaking it up, but it's all here (in blue for clarity).
When using Network RAID 2 it protects you from multiple disk faults, complete array faults and site faults with auto failover and failback. NetApp can’t deliver this level of HA with auto failover and failback. Features like MetroCLuster give you data protection, but not HA, and at a lower capacity utilization than LeftHand. Can SnapMirror or MetroCluster automatically fail back, incrementally rebuild the primary site, while maintaining application state and data integrity – i.e. RPO=0 and 100% uptime? I didn’t think so.
Then you'd be wrong; that's exactly what MetroCluster is about. Except the auto failback; that's just adding a disaster on top of a disaster. But I digress, and that's the subject for another post.
It's worth pointing out before I analyse this claim that I originally thought that LeftHand had come up with a new paradigm with its network RAID; that it provided both data protection and high availability built on commodity tin. I was wrong; if it was that easy, we'd have done it. But NetApp's 15 years experience in doing this stuff has taught us otherwise.
First, I need to explain what a NetApp cluster is all about; then we can compare and contrast, and ask some questions.
NetApp Data Protection and HA
NetApp uses NVRAM (non volatile RAM) and transaction logging to capture writes to disk. All writes are acknowledged before they get to disk, but only after they've been logged. This means that when a single controller (non-HA) fails, the data we said to the server or client that we wrote, but stored in NVRAM, is replayed to the disks when the system comes back up. That way, we guarantee what we promised when we said we'd written the data. The data is both consistent and durable.
In a cluster or HA solution, we ensure cache coherency; the contents of the NVRAM on controller 1 are mirrored to controller 2. That way, when controller 1 goes down, controller 2 can replay controller 1's writes, and take over its workload, as the disks are addressable from both controllers. Again, the data is both consistent and durable; and we've made it HA, without downtime to the application. It carries on running.
If controller 2 now fails (or both fail together), we're still consistent and durable; see above for the single controller case.
Lastly, SATA drives in particular can suffer from "lost writes". Every drive has a cache where it stores data to be written. This is separate from any other protected cache, for instance NetApp's NVRAM.
As soon as an IO hits this buffer, the drive acknowledges the write. But blocks can subsequently be written in the wrong place, or not written at all, especially if there's a disk failure between acknowledgement and the physical write.
Because NetApp has the ability to control both the RAID and the file system, Data ONTAP 7G provides the unique ability to catch errors such as this and recover. Along with a block checksum, ONTAP also stores WAFL metadata (the inode # of a file containing the block) that provide the ability to verify the validity of a block being read. If the block being read does not match what WAFL expects, the data gets reconstructed ensuring that your data is both consistent and durable.
NetApp goes to extraordinary lengths to protect your data.
Here's the issue. I can't see this level of protection in a LeftHand SAN.
LeftHand Data Protection and HA
LeftHand systems are built from commodity servers and use a battery backed RAID controller to provide a log of writes to disk. This means that when a single node (non-HA) fails, the data in the cache is replayed to the disks when the system comes back up.
But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?
In a cluster, and using nRAID2, the IO is copied to a second node, the same scenario as in the single node case is played out. Effectively, cache coherency is provided by mirroring the data to a second (or third, or fourth) node across the network, which is slower and adds to latency.
But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?
With nRAID2, when a node fails, you want to guarantee your writes because the other node(s) are now holding a single copy of your data. That nRAID2 is now no more than RAID5 striped across one or more nodes, writing data through disk cache.
But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?
The choices to protect your data?
- Turn on write-through (turn off the disk cache and force IO straight to disk). On all disks on all nodes.
- Choose nRAID3 so you have a second mirror.
The first option causes huge IO performance problems; drives that are forced to write directly perform very badly indeed.
The second option of nRAID3 is the only alternative.
And every read is still fraught with danger. You may read a block from node A but get a completely different data from node B for the same block request -- because there's no guarantee of protection against lost writes on any of the nodes.
The LeftHand Triplication Calculator
Ok, let’s talk capacity. All NetApp’s customers know NetApp’s storage utilization is below 50% when using best practices.
NetApp's best practices are here. See page 20, section 7.4 Best Practice Configurations. This is the same old HP tap dancing.
But instead of re-hashing what everyone already knows, let’s do a simple calculation for a highly available multi-site SAN using MetroCluster (as a side note, you know Calvin is taking it easy on you with the MetroCluster pricing.) Let’s say a customer has 10TB of NetApp raw storage at the primary site and they replicate that 10TB to a remote site for HA and disaster protection. Storage utilization is now 50% (10TB/20TB.) Take your 63% at both sites, and we won’t bother to include things like the space taken up for NetApp’s root volume and replication log files. 63% of 10TB leaves you 6.3TB of usable capacity replicated. This means you can create 6.3TB of data out of 20TB raw. That’s 31.5%. With LeftHand’s Network RAID level 2 you can split your SAN across 2 sites for a better HA solution and the customer’s utilization, according to you, is better than NetApp’s – 35%.
Except it isn't anywhere near MetroCluster in terms of HA and data protection -- in fact it's nowhere near a single controller NetApp system in terms of data protection.
LeftHand is now down to 24% usable for an inferior solution.
24% usable
The Rest
Now it’s time for some education:
Network RAID is set at the volume level. Not all volumes require the advance data protection level of Network RAID level 2, therefore utilization is typically much better if used at a single site.
At which point, you now have a RAID5 solution with no lost write protection (or you turn on write-through on every disk and suffer huge IO penalties). Really, there's no point. Might as well buy a cheap Linux server and be done with it.
If you really want to get schooled let’s talk about dual-parity based network RAID and what that does for utilization. What we should really be talking about is cost and capacity utilization of NetApp GX vs. LeftHand, because that is the only product that comes close to LeftHand’s architecture. Does GX support block yet?
No, let's focus first on my claim that LHN means more space, more power, more cooling, more cost per usable TB for an inferior solution. Address that, and then you get bragging rights.

Let's make myself clear: any unrecoverable lost write error is one too many.
What I would like to know: what are the chances getting such an error? What is the "Mean Time Between Lost Write Errors"? What are the chances I've got to fall back on tape or whatever to recover an environment corrupted by lost writes?
Posted by: Mathieu van Schaik | June 29, 2009 at 12:27 PM
When using Network RAID 2 it protects you from multiple disk faults, complete array faults and site faults with auto failover and failback. NetApp can’t deliver this level of HA with auto failover and failback. Features like MetroCLuster give you data protection, but not HA, and at a lower capacity utilization than LeftHand. Can SnapMirror or MetroCluster automatically fail back, incrementally rebuild the primary site, while maintaining application state and data integrity – i.e. RPO=0 and 100% uptime? I didn’t think so.
>>“Then you'd be wrong; that's exactly what MetroCluster is about. Except the auto failback; that's just adding a disaster on top of a disaster. But I digress, and that's the subject for another post.”
I’m not letting you brush this one under the rug. MetroCluster and SyncMirror don’t provide the same level of availability that is inherent in LeftHand’s base SAN/iQ software offering, and LeftHand requires no additional equipment.
I found the following quote in NetApp’s Data ONTAP Active/Active Configuration guide:
“Mirrored active/active configurations do not provide the capability to fail over to the partner node if one node is completely lost. For example, if power is lost to one entire node, including its storage, you cannot fail over to the other node. For this capability, use a MetroCluster”
MetroCluster is nothing more than a standard NetApp cluster that has been stretch or separated. Once it’s deployed you lose local high availability and any major fault will result in a site failover.
The standard MetroCluster solution requires manual intervention for both failover and failback. So the claim of an automated and transparent failover solution is false.
Each Filer head must have a licensed copy of the following software: SyncMirror, Cluster license, Remote Cluster license and MetroCluster License. How is that simple to deploy and manage?
LeftHand replicates over standard IP networks. LeftHand has customers synchronously replicating over 100km over standard Ethernet networks with latencies of <3ms.
NetApp MetroCLuster is limited to 500m over IP. For distances over 500m it requires four Fibre Channel switches in a dual-fabric configuration, Fibre to IP bridge equipment, and a separate cluster interconnect card.
For example, a Fabric MetroCluster requires:
An active-active pair of FAS900, 3000, 3100 (two single controller chassis), or 6000 series controllers running Data ONTAP 6.4.1 or later
1. Four Brocade Fibre Channel switches with supported firmware supplied by NetApp - a pair at each location.
2. Brocade Extended Distance license (if over 10km)
3. Brocade Full-Fabric license
4. Brocade Ports-on-Demand (POD) licenses for additional ports
5. A VI-MC cluster adapter
6. A syncmirror local license
7. A cluster remote license
8. A cluster license
9. Associated cabling
MetroCLuster requires a full-copy resynch for SyncMirror rejoin and is at risk if a second failure event occurs before resynch is complete. LeftHand performs an automated failback with changed data resynch of the primary site.
Auto-failover you say? Let’s look at page 15 of MetroCluster Design and Implementation Guide
“Upon determining that one of the sites has failed, the administrator must execute a specific command on the surviving node to initiate a site takeover. The command is: cf forcetakeover –d The takeover is not automatic because there may be cases in an active-active configuration in which the network between sites is down and each site is still fully functional. In this case a forced takeover might not be desirable.”
“The cf forcetakeover command previously described allows the surviving site to take over the failed site’s responsibilities without a quorum of disks available at the failed site (normally required). Once the problem at the failed site is resolved, the administrator must follow certain procedures, including restricting booting of the previously failed node. If access is not restricted, a split brain scenario may occur. This is the result of the controller at the failed site coming back up not knowing that there is a takeover situation. It begins servicing data requests while the remote site also continues to serve requests. The result is the possibility of data corruption.”
LeftHand has distributed quorum management that eliminates all possibilities of a “split brain”. This allows at least one site to operate and then automatically resync the other sites when they come back online. In a Lefthand SAN a volume instance spans both sites; it is not a copy of a volume. This means the applications see the same volume serial number and metadata that was on the primary site, because it’s the same volume. This makes application failover and failback seamless.
Since NetApp uses a copy of a volume instead of the same volume they document the following:
“iSCSI and Fibre Channel LUNS may need to be rescanned by the application if the application (i.e., VMware®) relies on the LUN serial number. When a new FSID is assigned, the LUN serial number changes.”
I won’t get into automatic failback, because you already admitted NetApp can’t do that, plus it’s a mess and would take another page on this blog to describe it.
>>It's worth pointing out before I analyse this claim that I originally thought that LeftHand had come up with a new paradigm with its network RAID; that it provided both data protection and high availability built on commodity tin. I was wrong; if it was that easy, we'd have done it. But NetApp's 15 years experience in doing this stuff has taught us otherwise.
It’s this big company ego that allows storage startups to do what they do best, think outside the box.
I can’t get into details about how LeftHand’s software works, but I can say the following:
1. HP RAID controllers are among the most technologically advanced controllers in the market, and they’ve been developing RAID controllers for longer than NetApp’s been in existence. The RAID controllers contain battery backed cache and do not release the cache until the data is on disk. We are not exposed to “lost writes.”
2. We do background scrubbing and recreate or reallocate bad disk sectors off of parity or Network RAID.
3. We can have any number of controllers reading and writing to any number of volumes. For example, a 4 node system has 4 controllers performing I/O, and with Network RAID up to 2 controllers can fail without the SAN going down. A 6 node system can support 3 controller failures, etc.
>>LeftHand Data Protection and HA
>>LeftHand systems are built from commodity servers and use a battery backed RAID controller to provide a log of writes to disk. This means that when a single node (non-HA) fails, the data in the cache is replayed to the disks when the system comes back up.
This is a completely false statement.
>>But what about lost writes? Do LeftHand SANs provided protection against failure to write data correctly from the disk's cache?
Yes
>>In a cluster, and using nRAID2, the IO is copied to a second node, the same scenario as in the single node case is played out. Effectively, cache coherency is provided by mirroring the data to a second (or third, or fourth) node across the network, which is slower and adds to latency.
There is a slight write penalty, but reads are much faster, which means there is very little performance difference when running Network RAID with a typical 60/40 read/write workload.
>>With nRAID2, when a node fails, you want to guarantee your writes because the other node(s) are now holding a single copy of your data. That nRAID2 is now no more than RAID5 striped across one or more nodes, writing data through disk cache.
True. Just like NetApp has a single point of failure after a controller fails.
>>And every read is still fraught with danger. You may read a block from node A but get a completely different data from node B for the same block request -- because there's no guarantee of protection against lost writes on any of the nodes.
This is obviously not true.
>>At which point, you now have a RAID5 solution with no lost write protection (or you turn on write-through on every disk and suffer huge IO penalties). Really, there's no point. Might as well buy a cheap Linux server and be done with it.
Like I said before, we are not exposed to lost writes.
Look under the hood of a NetApp controller head and you will see the same x86 chipset that is in a LeftHand node.
BTW, we don’t have performance degradation over time like NetApp because we always allocate blocks in sequential “strides”, and re-layout data in the background if required to keep I/O as sequential as possible. There is little to no performance degradation as the LeftHand SAN fills up.
>>No, let's focus first on my claim that LHN means more space, more power, more cooling, more cost per usable TB for an inferior solution. Address that, and then you get bragging rights.
Ok:
If you take the power used in a NetApp controller head and amortize it over the disk shelves, and add it to the power consumption of the disk shelves, NetApp uses more power than LeftHand. LeftHand also uses SAS drives instead of Fibre Channel drives, which consume less power. Other energy savings features incorporated into server architectures help reduce power even further.
LeftHand use less space because we don’t have the controller head and our 2U 12 drive disk shelf is 6 drives per U, and NetApp’s 14 drive shelf is only 4.66 drives per U.
Our storage utilization with best practice configurations is comparable to NetApp’s in a non-HA configuration and better in HA configurations. When using Parity Based Network RAID it is better than NetApp in most configurations.
Looks like I have bragging rights.
Posted by: John Spiers | July 01, 2009 at 03:52 PM
John,
1. The RAID controllers contain battery backed cache and do not release the cache until the data is on disk. We are not exposed to “lost writes.”
2. We do background scrubbing and recreate or reallocate bad disk sectors off of parity or Network RAID
I'm disappointed, as I would have expected that an engineer with your apparent level of storage experience would understand that lost writes occur even when the disk has aknowleged the write back to the raid controller. This can be summarised as "disks occasionally lie". it doesnt matter whether your HP RAID controller keeps the write until it's aknowleged or not, if the disk sends back inaccurate information, you're screwed.
2. A traditional RAID scrub wont pick up this kind of subtle corruption without some kind of semantic knowlege of what data is "meant" to be there. As far as I can tell, you dont have that capability.
3. Even if your scrub does pick this up, what happens if you serve up a block that is subject to a lost write before the raid scrub gets around to fixing it ?
I submit, that despite your protestations that you are indeed subject to data corruption via "lost writes"
Secondly, while I think you've done a good job with the quorums etc, I still think that adding a completely new set of spindles for a basic level of fault resiliency expected by every customer in the industry is wasteful, even if you give the stuff away, it certainly isnt "green".
Posted by: John Martin | July 03, 2009 at 12:32 AM
Mathieu
This is the paper you want to read. (You'll need GhostScript to read this PS file.) Robin Harris summarised some of the findings in his blog that I pointed to.
Also note the difference between media and data scrubbing. NetApp does data scrubbing; it looks like LeftHand does only once a month media scrubbing, something that's quite inferior in its ability to detect latent errors. NetApp goes to exceptional lengths to protect your data.
Posted by: Alex McDonald | July 05, 2009 at 02:00 PM
John Spiers;
If I could find any best practices documents for LeftHand, I might be tempted to cut and paste too. Point me at some, I can't find them.
I'm afraid your whole reply demonstrates confusion between high availability (which you have) and adequate data protection (which you don't have).
John Martin's disappointment is mine too; I expected better from an engineer with your background on the subject of lost writes. FC drives have checksums at the hardware level for that very reason -- data reliability -- and NetApp ensures that non-FC drives like SATA get the same rigourous checksum protecton in software against lost writes. You seem to think they can just be ignored, because your assertion that you don't have the problem is very far from the truth.
And nRAID still makes a nonsense of your colleague Chris McCall's claim
What Chris meant was more space, more power, more cooling, more cost per usable TB for an inferior solution. You do the sums at 34% utilization, and you'll see what I mean.
Posted by: Alex McDonald | July 05, 2009 at 04:11 PM
I posted this on a HP blog last week, but thought it also fits here.
NetApp talks about having 99.999% uptime and providing HA. I came across a Avanade PDF on NetApps website (media.netapp.com/.../Avanade_Testing_Center_NetApp_Whitepaper_Exchange.pdf) that discusses testing the HA functionality of the FAS3050c array. I was quite shocked to find out on page 6 that both hard and graceful failovers caused 2 minutes and 27 seconds of user downtime. Maybe I'm wrong, but I don't see how that downtime is acceptable unless you are using it strictly for NAS. I would hate to see what would happen to a database or VMware if that were to take place while they were running.
I am currently a HP LeftHand Networks customer and have been very satisfied with their product. I have unintentionally as well as intentionally caused HA(using nRAID2) to kick in by failing LH node(s) and have never had a problem. I have never seen any user downtime because of it, and we are running Exchange, SQL, File, VMware, Oracle, etc. It's been nice to be able to do upgrades in the middle of the day without having to worry about downtime. Unless NetApp has made some changes to improve the Filer failover time, there is really no comparison between HP LeftHand and NetApp. From personal experience, when it comes to High Availability, HP LeftHand provides a far superior product. Granted I do not have a NetApp system to gain personal experience with, so I have to take the referenced PDF for what it is worth, especially since it is coming from NetApps website.
Thanks,
Chris
Posted by: Chris Brown | October 13, 2009 at 08:29 PM
@Chris
Thanks for commenting.
With appropriate timeout settings on applications, there's no downtime, even though the failover may take time to complete.
There have been many updates to NetApp systems since that paper was published. In general, for failover situations the rule of thumb is, for Data ONTAP 7.2 – 120s; and for 7.3 and beyond – 60s.
Notice that these are worst case timings. On average the timings will be much better. And our customers don't have to take downtime to upgrade; just a controlled failover and giveback.
Secondly, the zero RTO/RPO failover scenarios you have tried with LHN equipment are achievable with NetApp systems; see our MetroCluster for true HA and DR robustness.
LHN don't have any certification or user-measurable uptime statistics. Ours are drawn from our AutoSupport system that reports, in detail, information about the performance and availability of NetApp systems, from which we collect uptime statistics. IDC have certified NetApp's methodology for uptime measurement which is currently in excess of 99.999% for a single (non-HA) system, and near 6 9s for HA systems.
That's the difference I see every day. We can back up our claims; can LHN?
Posted by: Alex McDonald | October 14, 2009 at 02:28 AM
@Chris
Just one last observation; are you using VMware's vMotion (you indicate you're a VMware user)?
If you are, then motion times of minutes are usual and nobody considers that downtime, since the applications are still up. Applications that can't tolerate these kind of failover scenarios aren't much use in a virtualised environment either. Most are, espcially those you mention; Exchange, SQLserver, Oracle etc.
Posted by: Alex McDonald | October 14, 2009 at 03:22 AM
@Alex,
Thanks for your replies and clarification. It's good to know that failover time is now in the 60s. That sounds a bit more reasonable. Is there a difference between a controlled failovers and uncontrolled failovers?
I am glad to see that you compare LHN zero RTO/RPO failover with NetApp using MetroCluster. When comparing HA/DR functionality between the two, I would have to say that NetApp would have to run MetroCluster in order to really compare the two. Even though LHN(with at least nRAID2) and NetApp with MetroCluster have similar HA functionality, there still is a couple of key differences that I can think of off the top of my head between the two.
Please correct me if I am wrong with any of these.
1. NetApp does not load balance between the filer and storage in each site(up to 2 sites). One location is the active and primary location. - LHN load balances between all controllers/storage in all site(s)(up to 4 sites).
2. When NetApp fails over, VMware has to do a rescan because the volume signature changes. Wouldn't this cause downtime? This does not happen with LHN.
3. With LHN, you can change which volumes to enable HA with, this can be done on the fly, turn on/off and your choice of no replication, nRAID(2), nRAID(3), and nRAID(4).
4. In the event of a site failure/recovery, LHN automatically syncs all changes back to the site without any user intervention which resumes HA protection level and controllers/storage become instantly available. NetApp requires administrator intervention to gracefully fail it back.
5. Last but not least, the MetroCluster solution from NetApp costs about twice as much, if not more. I know this because we recently did a comparison between the two and chose LHN over NetApp.
In regards to your comment about VMware, I assume you meant VMotion instead of motion? I am not sure what you mean that it takes minutes to VMotion a virtual machine. I have not seen that, I've seen it take up to 10 sec paused/downtime to VMotion. Are the minutes you mentioned, due to the storage underneath? Also when talking about the apps in VMware such as Exchange, SQLServer and Oracle, do you normally use vmdks for their data/log files? I use iSCSI within the VM's and I know that if I disconnect the network for long enough (30 sec.) that it can cause the database to crash. So I am still skeptical of a database application staying up during 60s of storage failover/downtime.
That's my thoughts. Take them for what they are worth and let me know what you think.
Posted by: Chris Brown | October 14, 2009 at 04:05 PM