Yesterday I found myself in a situation no one ever wants to be in: I needed to restore a file from a customer's backups, but could not because no backups existed.
How did we get to this abominable condition? Through a series of errors and bad practices. Are you making any of these mistakes?
First, the box being backed up is a Linux box. That's unimportant - the mistakes that followed have nothing to do with Linux. I chose Microlite Backup Edge as the backup software so that we could do a complete bare metal restore if necessary. That wasn't a mistake.
I configured the software to do a FTP backup to a Windows server that I don't administer. Of course I needed the cooperation of the Windows administrator to set up the FTP directory for the backups. No mistakes so far.
I configured the software to only notify in the even of failure. That's not my usual habit: ordinarily I would notify by mail or printout either success or failure. I do not remember why I only set this for failure; perhaps I was convinced that they did not need the "Success" notifications. I think that is a mistake, because the absence of an expected notification could signal a problem. If notification of success was expected, someone would have noticed the subsequent failures.
On the other hand, I have had customers completely ignore failures because all they were looking for was a piece of paper and were not actually reading it, so it is possible this would not have helped.
I configured the failure notification to print to a networked LaserJet printer. I did test the printer, but I did not test the failure report specifically. That's probably out of habit - I subconsciously expected that the "test" would be from the first backup's "Success" report, but that wasn't configured, so it was never tested. The failure notifications did print when they eventually occurred, but the printer was not configured to handle Unix LF's correctly, so only one line printed and that of course was incomplete. Probably these looked like junk and were thrown away. Bad mistake.
Over the next 53 weeks, the system did 106 backups (two a week, which we had established as reasonable) and failed once. I know that from looking at backup logs. Of course the single failure wasn't noticed but that wasn't at all critical. No real issue here..
Sometime in March of 2007 the company renumbered this part of its LAN, going from a 192.168 network to 172.30. Of course they did notify me of that, but none of us remembered that the backups were going to the old 192.168 address. From that day onward, the backups failed, but as the reports were not being seen, nobody noticed. The renumbering of course was not a mistake, but not remembering that the backup was hardcoded to an old IP certainly was.
Again, if the printer had been producing a readable report, this might have been noticed. It wasn't. Week after week, backups failed. I may have even logged onto this system for minor changes during this period, but never checked backup logs. Another mistake. Technically, I was not responsible for checking, but I easily could have. I did not - until yesterday, when I needed a file. At that time I asked the Windows admin for an early tape because I had forgotten that we did this by FTP. He remembered that I was backing up over the network and reminded me.. it was then that I looked at the logs and realized we had a problem..
However, we still had backups on the Windows server, right? These actually would have been fine for what I needed as the file I wanted hadn't changed in years. Unfortunately, sometime in August the Windows admin was looking for some extra space and came across the FTP directory that had not been used in three months. He conferred with a person at the company who has some responsibility in this area, but neither recognized that this was our Linux backup so they removed everything to gain space. Big mistake, but perhaps not too bad, because the Windows server itself also gets backed up to tape.
Unfortunately, the tapes only are retained for thirty days. That's yet another mistake: a good data retention policy keeps monthly and yearly media. Ideally the yearly media would go back a number of years for forensic purposes if nothing else, so this was the killing mistake: we had no backups whatsoever.
So that's it: a series of small mistakes that led to a disaster.. well, not a complete disaster: I can manually fix whatever problem there is in this file, but it would have been quicker and easier to just restore it.
If you have not completely reviewed your backup policies and strategies recently, that is something that needs to be done, probably at least once a year. That's the final mistake: if the policies had been reviewed and backups checked yearly, this might have been less damaging.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2012-07-14 Anthony Lawrence
Doing linear scans over an associative array is like trying to club someone to death with a loaded Uzi. (Larry Wall)