A few weeks back I received a call from a client who isn’t running one of our monitoring solutions. “Help! ServerX is not responding! I can’t even connect to it via RDP.” This machine happens to be a virtual machine, so I ask if he can log in to the host machine, and check for any errors on that guest.
“Yeah, drive D is showing -1 MB free.”
Wow, negative drive space! That’s pretty cool, I think to myself. We run through a couple options to try and recover gracefully. Those fail. So, I drop back an extra 5 yards… I’m going for a Hail Mary.
“OK, I don’t like this option but here’s how we’re going to fix it.” I begin.
I then walk him through forcing the guest machine to power off. We add an drive space to D, and reboot the machine. As soon as it boots, I RDP onto the box and look at the wreckage. Turns out the log file had grown out of control. I dig a little deeper and find that the database I’d set to full recovery mode for them almost a year back was still in recovery mode.
Just one little problem…The maintenance plan to take a transaction log backup and full backup was disabled.
I walk the customer through a full backup. I then resize the log file to an appropriate size. And we re-enable both backup maintenance plans. Once they’re back up and running, and the threat of any additional failure is past we start digging in to the post mortem. We want to find where the failure came from.
I dig through the agent log and find out when the last successful backups were ran. I then work with the client to figure out what changed on that day. Turns out a newer technician, who had very little SQL experience took over the maintenance of the server.
Long story short, I ask for a private training session with the new technician. We work through the different recovery types available in SQL Server. We go over the pros and cons of each, as well as the costs of each. I then run through a few scenarios of how we can have failures, and how we can recover from each. I want the technician to be a little better equipped to handle the responsibilities he has, without calling him out in front of his bosses.
In the end, I pitch the idea of setting up some monitoring to help catch this sort of event before it becomes critical. Hopefully they choose to go with that monitoring.
There are two take-aways from this story. One: know your recovery strategies in SQL Server, or contact us and we can help you determine which strategy is best for your company and your data. Second: be on the look out for problems before they can do any real damage. Prevention really is the best medicine. We can help with that too!