I recently had a epic failure on my workstation. It was bad. My machine simply stopped working, and when I rebooted everything looked fine, until I saw it blue screen during boot. So here I go with lessons learned during a critical failure.
RAID 1 is great
…so long as your controller can still function after a drive is detected as bad. Unfortunately my motherboard, the ASUS P8Z68-V PRO, appears to lack this functionality. The Intel Z68 SATA 6.0 Gb/s controller reported that one of my two drives was bad. It suggested I replace the drive immediately. Unfortunately it wouldn’t allow me to boot in the mean time. It simply sat at the bios screen waiting patiently for me to do something.
RAID configurations can dissappear!
After digging into the problem a bit more, I moved around the BIOS looking for a reporting utility, or something to confirm the drive was indeed b. After another reboot, the RAID controller actually changed it’s mind. It informed me my RAID 1 member disks were no longer members. In fact, I had no RAID disks on my machine at all! Awesome news considering…
When organizing your backups, copy, don’t move!
Yeah, I had recently moved my backup data onto the RAID, so I could de-dupe the backups, and organize it before returning the data to the backup drive, and ultimately off-site. Great news, that RAID/drive failure just cost me more than 7 years of scripts, databases, test data, documentation, etc. Time to find a great un-delete utility for my USB drive.
Read all the negative reviews on your hardware before purchase.
They’re not all bunk. Sometimes there is a nugget of truth in those negative reviews, even when you know some users are simply venting frustration. Also, you can sometimes find useable debugging information for your current failure.
Learn from your failures.
You can always learn from success. Learning from failures, is a lot harder. Or at least it’s more painful. I’m going to rebuild this server, and get back up and running. Hopefully I can resurrect my backup data. After that, I will get my offsite backups up and running. I’ll also learn from others failures too, and use that to make my machine more stable going forward.
So for the next few days, I’m going to be recovering from this disaster, instead of studying for my MCM knowledge exam. Until then, enjoy the holidays..and make sure you have a robust enough plan to handle an epic failure. I’m just sayin’.