Hi All
today, I show you a special simple Design for Continues Operation including Disaster Recovery . The Design is based on 2 Standard Products VMWare HA and NetApp Metrocluster and does not require scripting, nearly every failover scenario is handled automatic. The Design is for companies with fast Network Connection between the Production and Disaster Recovery Datacenter .Best Case the Network Speed is the same like in the Datacenter .
The Main Question for DR , is there a way doing DR without complex Storage & Server Installations which require a lot of Installation services and ongoing maintaince ? Yes it is , with VMWare HA and NetApp Metrocluster . In my last Blog , I talked about VMWare HA ,which failover a vm automatic and transparent from one ESX server to other ESX server in case of a failure , now we split this configuration across 2 Datacenter . The picture show 2 datacenters , primary and secondary site .We build the VMWare ESX HA Server Cluster over two datacenter . The same for the NetApp Storage with the Metrocluster Option . Metrocluster does in addition to transparent Storage Site Failover also Data Mirroring between the different Datacenter . See the picture we have the user connected to active VM1 on Production site , the data for VM1 is on this Site as well .
How does this work in the different scenarios ?
a) Server Failure
See the picture above we have the user connected to active VM1 on Production site , the data for VM1 is on this Site as well. The user is connected to VM1. Now the Server fails see the picture on the left. VMWare HA detect the failure and switch the VM1 over to the Server on DR Site . The user don't realize any real interruption ,his work continue he is still connected to VM1 , which runs now on the DR Site .
b) Scenario Failure at the Storage Layer
Something breaks at the Storage Layer , so Metrocluster transparently switch the Storage Service with the Data of VM1 to the DR Site . Enduser is still working with his VM and Application on the Production Site without any Interruption. The Data or VMFS Datastore for VM1 is now connected from the DR Site .
c) Site or Datacenter Fail
A Datacenter goes down through Power , Fire or whatever Issues , in this case VMWare HA switch the VM1 to DR Datacenter , Metrocluster switch over the Storage Resources for VM1 to the DR Site . Enduser is still working and not realizing a real Disruption , he is still connected to VM1 , now running out of the DR Site .
The above Solution Design give you 0 RPO (recovery point object ) and RTO ( recovery time object) . Again key is the Network Speed and Design between the Datacenter .
All the Best
Manfred

Isn't it so that when one site loses complete power, on the surviving metrocluster node you will have to manually force a cluster takeover?
so in case of a complete site failure the RTO will not be 0...
The data will still be intact but there will be some issues concerning the availability of the data as it will need a manual intervention.
Posted by: Tom | September 09, 2009 at 04:10 AM
Hi Tom ,
great catch , I believe you refer to a split brain scenario , if all connections between Sites go down at exact the same time , which is very rare if you have a design with different physical connections and network components between sites . For this case, most Customers have control instance to initiate the takeover .
Best Regards
Manfred
Posted by: Manfred | September 09, 2009 at 05:30 AM