On Monday, I posted a blog called Tears of Recovery. In that post, a former consulting client of mine contracted me to redesign their backup architecture. The three key areas with the highest priority and criticality were
- ClearCase
- MS SQL
- Exchange
Where we left off was at my final deliverable. I wrote an Operations Guide for the backup/recovery solution, trained the Network Operations Center personnel and the various admins who would be managing the backup infrastructure across the country and offered several observations based on decisions they had made along the way. I felt strongly if these areas were left unchecked, over time they would have exposure and undue risk.
Just to recap - here are the three top priority "red flags" I highlighted.
- In order to save costs, they chose NOT to buy maintenance on the NEW library they purchased
- They chose NOT to implement the SLA that I suggested which included regular TEST recoveries of critical business application data
- They chose NOT to run the duplication scripts that I wrote for them to create secondary backup copies for offsite storage
Now several months after I had completed this work, including the knowldege transfer, this client had experienced a critical "system" failure. This failure was further exacerbated by the number of failed backup jobs or partial backup jobs. So instead of being able to restore from the previous night's backup they had to go back a bit further (1 week). However, when they attempted to restore that backup it failed. I believe the error message they kept getting was a rather ambiguous error message. The bottom line was they couldn't recover.
Timeline
Day 1: Storage used for Exchange failed. (for the record, this wasn’t NetApp storage but from a big box pusher vendor)
As I mentioned, when the failure occurred (about 2:00am on a Monday morning) – the NOC personnel called the SysAdmin for Exchange and the recovery process began. As you would expect, there was an attempt to bring the storage back online thinking there was just a “glitch” somewhere. These attempts took several hours as much of this was done by the onsite NOC personnel and relayed back to the Admin via phone.
By 7:00am, the SysAdmin and his team were in the data center working on the next step to recovery, the tape backup. Searching for the most recent backup was fairly painless; the pain however came when they notice the most recent backup was over a week ago. As I mentioned above the failed or partially successful backup jobs plagued them during the previous week.
What made this worse is this Exchange server had some very high level managers and directors mailboxes on it – being down meant reduced communication.
When the attempted the restore from this backup – the job would start, the backup application would communicate to the tape library – requesting a particular barcode to be mounted in an available tape drive and the restore job would begin. The restore would get through the first tape and request the mount of the second tape and that’s when the read error would occur.
RESTORE FAILED
Remember this red flag I pointed out earlier?
- They chose NOT to run the duplication scripts that I wrote for them to create secondary backup copies for offsite storage
Had they cloned or attempted to clone the backup tapes, this may have given them a clue there was something amiss with their solution and could have taken other actions to remedy the situation PRIOR to a CRITICAL FAILURE. Unfortunately they were well passed the point of no return and were in REACTIVE mode.
They tried the restore again, with the same results. Believe it or not this went into Tuesday morning when they pulled together a “tiger team” to determine what was going to be their next move.
On a whiteboard, one of the mangers started writing down all of the vendors who had products in their Exchange environment.
1. Switch Vendor
2. Server Vendor
3. Storage Vendor
4. Software Vendors (including backup software, Microsoft naturally for Exchange and operating system vendors)
5. Tape Library Vendor
6. Consultant (uh, that would be me)
Then someone had the idea to call in each one of these vendors, tell them the situation and employ their help to fix the problem. When they called my company they were told that I was out sick – I had been ill with a fever of 102.
I'll never forget when my cell phone rang and heard my partner on the other end saying “they need you and are willing to pay whatever you want to come help them resolve the problem.”
Well, let me tell you there’s nothing like saying “willing to pay you whatever you want” to get someone out of bed with a fever. I loaded up on pain relievers and headed off to the site.
When I arrived I was amazed at how unorganized everything was – basically this is what was told to all the vendors before I arrive, “go and find the needle in the haystack”.
Firmware was being updated, patches were being applied, tape drives were being tested/replaced and I was gathering log information from the backup application. All of this was happening in parallel.
Remember…
- They chose NOT to implement the SLA that I suggested which included regular TEST recoveries of critical business application data
Had they documented a test plan, detailed recovery process and tested this process, the chances are extremely high that they would have uncovered the issues during this test.
About half a day into my analysis I uncovered what I believed to have been the problem. I ran over to the tape library service person and asked if he had removed/replaced the drives. He sure did, the drives had been removed. I asked what the problems were with the drives he found. He outlined the problems, but I focused on one in particular. One of the drives was out of calibration just slightly and couldn’t be brought back within spec so he replaced the drive. I asked him if it was logical drive 10 and he responded in the affirmative.
The service technician told me if they had purchased the service contract all of this would have been covered.
- In order to save costs, they chose NOT to buy maintenance on the NEW library they purchased
What I had uncovered was the tape cartridge that failed with the read error had originally written using logical drive 10 – the drive it had been mounted to repeatedly in attempts to restore was logical drive 6. My belief (which I’ll never be able to confirm) is drive 10 was just enough out of alignment to make it impossible for any other drive to read what it had written but not far enough to fail during the mount and write process. Unfortunately we’ll never know.
Incidentally, after all the drives were replaced – the restore still failed in the EXACT same spot – which seemed to confirm my assumption.
As a last resort the client sent the backup tapes and array off to a data recovery service which could read that data, recovering it to some portable media which was eventually shipped back to the client.
Five days after the initial failure the client received the portable media they had supplied to the recovery service with the recovered data on it. Incidentally, these were all individual .pst files that had to be merged back into Exchange. Since it was late on a Friday, the SysAdmin copied the data to a storage array they had offline (because it was having problems and not production ready – as he told me) – after he confirmed all the .pst files were there – he put the portable media on the shelf to be ‘re-used’ by whoever needed it and went home.
Does this sound like a Looney Tunes cartoon? Isn’t this the part were Wile E. Coyote gets the anvil dropped on his head, followed by the crate?
Saturday morning rolls around; the SysAdmin comes in with his coffee in hand and begins the long process of brining the Exchange server back online by merging all of the .pst files. However – overnight two drives failed in this RAID-5 storage array. All of the data copied to this array the night before was LOST.
How do I know?
I got a phone call – “what do I do?” Well first thing I said was, restore from your backup - you have a pristine backup environment now – nearly everything is brand new.
More tears of recovery…alas, he never backed it up.
Luckily for him the portable media was still on the shelf and still had the data he needed…
Lessons learned
- Pre-planning saves time and personal effort
- Understand the impact to your business and invest accordingly
- Apply business continuity strategies as it pertains to the value of the data
- Test your plans
- Maintain/Update your plans
- Test again
- DON'T PANIC
I have presented and written about this subject extensively over the last 15 years - I'm still asked today, about the book I co-authored in 2003, if I will ever expand on the DR Planning and Business Impact Analysis Planning sections. Irrespective of the time that has passed since that book was published, the need still exists.
If you've been reading my blog you know that I have talked about NetApp's tiers of recovery - stay tuned for "Tears into Tiers" where I'll take this customer's environment and show what the customer experience would have been by taking advantage of the NetApp technology available today.
Chapa signing off...
PS. The cost for all of this? It was in the $100,000 range, all I know for sure is what I billed them and it was far less than $100K