Chuck Hollis wrote, in reply to Vaughn, with what is the essence of the problem with being on the wrong end of a disruption:
Chuck Hollis wrote, in reply to Vaughn, with what is the essence of the problem with being on the wrong end of a disruption:
Posted at 12:27 PM in chuck hollis, Technology Trends, Traditional Legacy Array, Vmware, WAFL | Permalink | Comments (2)
Technorati Tags: chuck hollis, das, emc, NetApp, vdi, vmware, wafl
When I first described my new terminology for storage tiers, Tim Burlowski, asked how backup and recovery fit into that model.
It turns out that that is an interesting question.
Backup, the only working ILM/HSM solution
Step 10 feet away, and you realize that backup is HSM.
The backup process creates cold data, the backup image. The backup software them moves the backup copy to cheaper tier of storage. When a restore is performed, the backup software moves the backup copy from the cheaper tier to a more expensive tier.
The question on the table, then, is where does this form of HSM still make sense?
Captive IOPS
One obvious place where HSM as part of backup process is for the backup of data on a Captive IOPS tier. The tier is built using very expensive storage. Storing cold data that will probably never get accessed seems odd and a waste of money.
But that's not necessarily true.
If the Captive tier is built using disk drives, and the application has a high IOPS data density, a lot of the capacity of the disks is unused. The cheapest place to store backup copies, in that case, would be on the same disks you're using for primary data.
If the Captive tier is built using SSD's like EMC storage, then it makes a lot of sense to use a backup solution to move backup images off of the flash onto cheaper storage. If the application has a low IOPS data density, then a better approach is to use flash cache in front of disk.
If the Captive tier is built using a flash cache in front of disk, the cheapest place to store the backup copies may be on the disk drives.
More generally, the backup methodology of a captive tier made up of flash cache and disk, will be the same as the backup methodology of the capacity tier.
So what about the capacity tier and backup?
In the general case, the capacity tier will be backed up much in the same way that it always has been.
A capacity tier that has real snapshots can eliminate the HSM part of the backup process.
In fact, in an earlier series on backup I proved that for a VMware infrastructure, the cheapest place to store backup copies was on the same disks that stored the active data.
Now I want to generalize that comment for all storage..
If you consider IOPS density, disk drives must contain large amounts of cold data. A subset of the cold data in a data center is the backup data. That data can either be stored on a separate set of disks or on the same disks that are being used to serve IOPS.
To store the data on the same disks the following things need to be true
It turns out that for NetApp systems all five are true.
So as disk capacities expand, storage architects who have the option to use real snapshots will stop moving data off of the capacity tier to some cheaper tier.
Said in a slightly more poetic way:
The final resting place for data on disk will be the disk where the data first got created.
The net effect will be a much simpler and more cost effective storage infrastructure to support backup.
One minor addendunm added after I wrote this post.
As Martin G says in the comments below, having the backup copies on the local disks doesn't protect you from site failures. So, of course, some form of storage replication is required. Thankfully NetApp has a cool technology called SnapMirror that replicates all of the snapshots to a remote destination in a space efficient way.
In an earlier post on VMware and backup I show how all of this fits together.
I should have included this need to replicate the data in my original blog post. So I am rectifying that error now.
Some time ago, I made the assertion that VMware and Dedup fundamentally redefined backup. The story made intuitive sense, but I was queasy because of the lack of really good analysis to back my intuition.
In my long series of posts on the arbitrage of backup, I finally was able to provide a model that precisely explained my assertion.
I created a model titled Backup Architecture Model that said:
Let
N = number of copies
Let
M = total spend
Let
SeconaryInfrastructureCost = cost of secondary infrastructure
Then
N = PrimaryCopies + SecondaryCopies
M = PrimaryCost*PrimaryCopies + SecondaryCost*SecondaryCopies
Now if
PrimaryCost * SecondaryCopies >> SecondaryInfrastructureCost
Then, we have a secondary infrastructure, and traditional backup.
But, if you read the long series, you'll remember that
PrimaryCost = Size(D) * CostPerByte + RTO*DownTimeCostPerMinute
What dedup does is reduce the value of Size(D) which reduces the value of PrimaryCost.
Which means that for a sufficiently high value of deduplication we can have
PrimaryCost * SecondaryCopies < or ~= SecondaryInfrastructureCost
Which means that the secondary infrastructure is unnecessary.
And it turns out we have such high values of deduplication in a VMware environment.
QED: Dedup redefines backup for VMware..
Which means the picture I drew,

isn't just wishful thinking by a NetApp engineer, but sound rationale decision making for a backup architect.
Posted at 11:09 AM in Backup, Data Center Operations Theory, Vmware | Permalink | Comments (0)
I promise to get to what SMVI does but I need to explain one more thing... For those who understand the term Virtual Machine Consistent Backup, just skip this post.
Two forms of Backup
There are two forms of backup, crash consistent and consistent.
A crash consistent backup is the moral equivalent of pulling the plug on a server and then backing up the data. The state of the data that is being backed up with respect to the users of the data is indeterminate.
A consistent backup is the moral equivalent of first performing a clean shutdown of the application, then the server and then performing the backup.
Of course, in practice, a consistent backup can be performed without shutting down an application or server by putting the application and server into hot backup mode while the backup takes place.
What's the difference?
A crash consistent backup may result in:
A consistent backup will
So given that consistency is better than crash consistency the challenge is how to make consistent backups as easy to perform as crash consistent backups and that's where SMVI fits in ...
What is a Virtual Machine Consistent Backup?
A virtual machine consistent backup is the equivalent to performing a server consistent backup. And by that we mean that any in-flight I/Os are committed to stable storage before the backup is performed.
in the case of VMware, the way that is achieved is by using VMware snapshots.
For HyperV you would use the VSS framework.
Posted at 11:40 AM in Backup, Server Virtualization, SnapManager, Vmware | Permalink | Comments (0)
In part I of my series on SMVI, I described the backup challenge caused by the need to move more data in the same amount of time using fewer resources.
In part II of my series on SMVI, I described the disaster recovery challenge..
In part IV of my series on SMVI and the data protection challenges of a VMware environment, I want to transition to the value of the NetApp Snapshot.
So what are the requirements for a great backup technology:
But why are these all necessary?
Guess what, a NetApp Snapshot satisfies all of these requirements, and a NetApp Snapshot is the only technology that satisfies all of these requirements. Some technologies satisfy some of these requirements, but only a NetApp Snapshot satisfies them all.
So how do Snapshots work? There is an astounding amount of EMC FUD on the net that describes how a NetApp Snapshot works, and not as much NetApp technical documentation. At the core, the story from Hitz's paper remains the same (although there have been about 15,000 man years of engineering done since his paper). We are able to create a snapshot in constant time because we have a map of the blocks that are allocated on disk. A snapshot is really just a copy of the block map rather than the actual disk blocks.
Because a snapshot is a point-in-time copy of a Flexible Volume, a Snapshot can be used to create a point-in-time copy of any kind of storage object within the Flexible Volume: LUNs or Files. For a good technical paper on Flexible Volumes I recommend the recently published Usenix paper.
So let's walk through each one of those data protection requirements and see how NetApp storage can satisfy them::
In my next post in this series I'll explain how SMVI uses NetApp Snapshot to create VMware consistent backups..
Posted at 08:43 AM in Backup, data management, Server Virtualization, SnapManager, Vmware | Permalink | Comments (0)
Before I can get into how SMVI actually works, I want to remind folks about how VMware storage and NetApp work together. For folks already familiar with VMware and NetApp you can wait for my next post where I describe how NetApp Snapshots eliminate the backup problem.
VMware Infrastructure and NetApp Storage
Consider this picture of a physical server connected to NetApp storage device.
Starting from the right moving left let's build up to the final picture.
The left most layer shows a collection of virtual machines running some guest OS. Each virtual machine has a set of CPU, Memory and storage resources. To the guest OS running in the virtual machine, the storage resources appear as if they were one or more locally attached SCSI disk drives.
Moving a little bit more to the left, we observe that the disks are really a kind of file called a vmdk contained within a datastore. A datastore itself is a virtualized storage pool.
If the ESX server is connected to the array using FC or ISCSI, a filesystem must be created within the datastore. VMware currently only supports VMFS. In that case the datastore is called, unsurprisingly, a VMFS datastore.
If the ESX server is connected to the array using NFS, the datastore is just an accounting structure. The file system within the datastore is implemented by the array. The datastore in this case is called a NFS datastore.
Finally all the way to the left we have the NetApp storage objects. In the case of FC, the array is configured with a LUN within a FlexVol that is contained with an aggregate. In the case of NFS, the datastore is contained within a Flexible Volume which is contained within an aggregate.
Okay so that's nice, but why was that important?
Now remember, the virtual machine has a bunch of disks that are mapped through multiple layers of virtualization to a bunch of physical blocks. SMVI is used to perform backups and recovery of those virtual machine disks using snapshots. In a NetApp infrastructure, you can only take a snapshot of a Flexible Volume. So to take a snapshot of a virtual disk you have to understand the mapping between the virtual disk and the containing Flexible Volume.
Whew.
In my next post, I'll explain how NetApp Snapshot address the backup challenge.
Posted at 08:09 AM in Server Virtualization, Vmware | Permalink | Comments (0)
Continuing with my series on SnapManager for Virtual Infrastructure, today I want to talk about disaster recovery (DR).
So what is a disaster?
Typically some part, if not all, of a data center is damaged. Things like hurricanes, tornadoes or more prosaically pipes leaking damage a significant chunk of the infrastructure. As a result whole systems are lost and have to be brought back online in some other location.
So what is recovery?
For storage it means bringing the data that was lost back online.
One obvious answer to DR is to move the backup data to the remote site via truck, and then to use the backup copy in the case of a disaster.
Okay.... but I just argued that doing full backups don't scale, and the alternative to full backups, incremental backups may make things worse. And even if incremental backups don't make things worse, the problem of moving the data from the backup target remains. Moving terabytes from a storage device or tape to a storage device takes time...
And so ultimately if you have any kind of reasonable recovery time objective, recovery from a separate backup infrastructure is not an option.
What to do?
Well you could have another storage system on standby at your remote location with a replicated copy of the data.
But why not have all your backups on the standby storage system as well?
Well, because unless you happen to be using a NetApp system it's simply not cost effective and of course, for EMC storage, there is that annoying performance penalty.
So in any real data protection infrastructure where you need to tolerate both local and site disasters you end up with two dedicated infrastructures in any VMware environment:
The backup infrastructure I described here, and a new separate DR infrastructure. The DR infrastructure is used to replicate the storage from the local to the remote site.
Of course, the remote site has it's own distinct backup infrastructure so the real picture is:
And of course there is the truck...
Okay, so what is the answer? SnapMirror to replicate the data, space efficient snapshots and deduplication to eliminate the cost of the storage, oh and by the way, a lot of hardware to throw out...
Posted at 01:18 PM in Backup, Data Center Operations Theory, Vmware | Permalink | Comments (1)
NetApp today shipped SnapManager for Virtual Infrastructure (SMVI), a product that when combined with real snapshots, NetApp's deduplication technology, primary storage capabilities and SnapMirror technology, fundamentally re-writes the rules of what backup infrastructures should be.
To give you a flavor of what I am talking about...
Consider a traditional VMware backup infrastructure looks like this:
The problems with this infrastructure are obvious.
The NetApp solution on the other hand looks like this:
Once you've deployed your ESX server on NetApp storage, all you have to do is install SMVI on the same machine or a separate machine from Virtual Center and you are done.
No separate servers, no separate software, and no separate infrastructure. The backup images are stored on the storage system itself.
There is no data copy when you create a backup, and there is a minimal data copy on a restore. That means faster and cheaper backups.
And oh-by-the way the DR solution is pretty simple too...
And finally, unlike the folks at EMC, you can use our snapshots without worrying about performance...
Posted at 01:49 PM in data management, emc, SnapManager, Vmware | Permalink | Comments (8)
One of the more interesting conversations I've had recently was how NetApp end-to-end deduplication and our replication technology fundamentally changes how you think about backup in a virtualized environment.
The core challenge in a VMware or HyperV environment is that the restore is extraordinarily painful because of the amount of data that needs to be moved. You are no longer just restoring a file, but an entire image including OS, application image, and data.
And the single biggest bottleneck to a restore is the data transfer from the repository to wherever the primary is connected.
The typical infrastructure has data flowing from some primary system, through a backup server to some backup media. The notion being that primary storage is expensive, and the backup target is much cheaper, making the restore time acceptable from an overall cost perspective.
If data transfer is so painful, then why transfer the data? Why not keep it around on the local system?
Well there are three reasons:
It turns out that NetApp Snapshot and our Deduplication fundamentally change the cost equation because of how they work together. NetApp Snapshot technology only keeps a point-in-time copy of the blocks that have changed inside of a volume. NetApp Deduplication eliminates any duplicate data between different storage containers within a volume.
And of course, unlike other snapshot technologies, a NetApp Snapshot is performance neutral so using them is not a space-time tradeoff.
And it turns out that deduplication has a small performance impact as well.
What does this mean? Well it means that the amount of copies a storage architect can keep on the primary is significantly higher than using other storage arrays. More snapshots, means less likely to go to a remote machine to restore data and that translates into faster recovery times!
I can see someone point out that the primary storage system will still be more expensive even if snapshots and deduplication is used. But here's the rub, the secondary system costs money and the restore costs money in terms of lost productivity during the restore. To make this worthwhile the primary system has to be cheaper than the cost of the secondary system AND the restore cost. This is a much lower bar to jump over.
But what about a disaster?
Enter SnapMirror our volume mirroring technology. Well it turns that SnapMirror allows you to replicate an entire volume and all of it's snapshots. It also turns out that SnapMirror will only replicate the deduplicated data.
What this means is that cost of the network and the secondary system is lower than using alternative technologies because the amount of data is less, and the amount of computation is less (no deduplication on the secondary).
It also means that the recovery after a disaster is faster, because the data is online and can be directly accessed rather than stored on Tape or a VTL and then restored to a primary storage system.
But that's not all ...
Because you are keeping the backup data on the primary storage system and are not moving the data other than to replicate for DR, this also simplifies the backup infrastructure. There are no complex highly available backup servers that need to be setup and managed.
And because you are keeping all of the backup data with the primary copy, there is no painful need to index and track data on remote devices.
So in conclusion
You get faster restores, faster recovery after a disaster, less complexity because there is no backup process and backup system to manage all because the cost of doing this got cheaper...
Pretty cool...
Posted at 11:03 PM in Backup, data management, Deduplication, Replication, Vmware | Permalink | Comments (0)
Dave, in his excellent post on server virtualization, asked the meta-question: What is really going on?
So here's my thinking.
Moore's law has produced ever increasing numbers of transistors and ever increasing clock frequencies that software developers have been able to exploit by doing nothing. In fact, the increasing number of transistors and increasing clock frequency made it possible to add even more value in software faster.
From a consumer of software perspective, the value of software increased over time even though the cost stayed constant and in fact declined in real-dollars over time. It was the speed with which transistor counts doubled and clock frequencies increased and the ease with which they were both exploited that created the high-tech industry.
Traditionally software developers would experience a speed up in software just by moving to the new hardware leaving more cycles to add more value. However, in this multi-core world, the only way to leverage the increasing transistor count to get performance increases was to write Multi-Threaded (MT) code that was Multi-Processor (MP) safe.
It turns out that writing MT and MP safe code is harder than writing code that ran on a single core processor. It also turns out that whereas in the past, performance improvements were transparent to the pre-existing software, increasing performance with multiple CPU cores requires changing pre-existing software.
The net effect was that software was unable to leverage the new transistors as easily as in the past.
So cores went idle. And it became harder to add more value to software. The free compute cycles of yesteryear evaporated.
And then, all of a sudden, data centers had these idle compute resources.
And some folks assumed that these were real idle CPU cycles. They, incorrectly, assumed that even if using the cycles was easy they would remain idle.
What really happened was that because traditional software could not transparently leverage the increased transistor count, the cost of a small increment in *usable* compute cycles, increased. In plain English, to run any new application you needed a new x86 server which for the most part was idle. So let's say you need 1% of a CPU, you had to buy 100% of a CPU, of which 99% was idle. But the fixed cost in racking, stacking, cooling and operating the processor was the same regardless of whether the CPU was 99% idle or 99% utilized. If the cost of a fully loaded server was 15,000 dollars, (server+racking+cabling+switches+routers), then customers were spending 14,850$ to get what should have cost only 150$ (1% of 15,000).
In effect we had broken the virtuous side effect of Moore's law: that increasing transistor count would result in more available compute cycles for applications at no incremental effort on the part of the software developer and as a result software would increase in value as developers added more features and functions with no increase in cost to the customer.
Worse, because of the economics of power and cooling we were staring at the uncomfortable realization that the cost of software (hardware+software) may be increasing rather than staying constant or falling.
Think about it, we had software that was not improving coupled with hardware costs that were going up, hardly a recipe for long term health.
Enter server virtualization which enabled the use of the cores without requiring MP safe code. And some folks got excited with server virtualization because they assumed that this was a sure fire way to save money. The theory was that we could now run all of those applications on those idle processors and that the total number of processors would decline permanently and net costs would go down. And the assumption was that this was a permanent cut in total computing cycles.
And the assumption underlying this was that we had dramatically over-provisioned our compute cycles, because these were truly idle cycles. In other words, even if the cycles were easy to use, we would not use them because they were unnecessary.
And I must respectfully disagree with that notion. If we really believed that the CPU cycles were really free then software innovation would have already stopped.
The real implication is that server virtualization made it possible to easily use those idle cycles. And once you made it easy, software developers and customers would start consuming them.
Because virtualization made it cheap again to get incremental usable CPU cycles, I believe the net effect is to increase the overall demand. This compression in servers is a temporary phenomenon. The demand for ever increasing compute cycles will continue.
Posted at 07:51 AM in Data Center Operations Theory, Server Virtualization, Technology Trends, Vmware | Permalink | Comments (0)
The views expressed on this website/weblog are those of the author alone and are not deemed to be approved or endorsed by NetApp. |

The reality of a disruptive technology is that it provides enough value at the right price point. Not the whole value, but enough of the value. And yes, the price point for VDI is the damn 50$ disk drive at BestBuy (or Frys for those of us in California ...).
If you can't deliver at that price point you're making a TCO argument. About how over time the customer will see real savings. Sort of how Sun competed with Linux for years, and customers kept saying but I want to pay less..
EMC claims to have figured this out from the early part of this decade: that either you deliver the price and value or you might as well not show up ...
Part of the point on my whole series on the DAS disruption is that you have got to be able to hit the price point. TCO is funny money. Real dollars are what people care about.
The fact that NetApp is able to compete with the cheapest alternative and deliver compelling value is nothing short of amazing. This may, just may, represent the tipping point for VDI in the enterprise.