Lightning hit something near a customer's building, came in through the cable wiring, and knocked out their internet connection. Their cable company was quite responsive, and had the whole thing fixed by the next morning, including a new cable modem.
Well, except that it wasn't quite fixed. They had a Linux box connected to the cable modem acting as a gateway and mailserver. Apparently the lightning wasn't quite done after it took out the modem, because the gateway wasn't routing traffic. That, of course, was when the customer called me. I happened to be fairly near by, so I drove over and surveyed the problem.
The mail server itself seemed to be fine. It was just the network card on the WAN side that was dead. A quick, simple fix: replace the card and we'd be done. Except..
Except a lot of things. First, I didn't have the right card to replace it with. Actually, I had no card. I could have run down to Staples or somewhere and bought something, but sometimes you can't find a good card there, and if you can, it's going to be ridiculously priced. Second, the physical circumstances of this mail server are challenging. That is, it is installed in a closet - albeit a large closet, but a closet nonetheless - and it shares that closet with several other machines. The various servers came into the business at various times and they never bought a KVM switch, so to even get at the box to replace the card was going to involve shifting around a bunch of clunky stuff. Moreover, we had already decided to replace the mailserver with Kerio KMS6 and put in a router to replace the gateway functionality. The router (and a KVM switch) were already here, sitting on a shelf ready to be installed, in fact. Finally, although I had been nearby, it was close to the end of the say, and I was tired and in no mood to start getting into that whole upgrade project - and I couldn't anyway, because we didn't yet have the Suse Linux I was going to run Kerio upon.
So I took the easy route. I programmed the router, rerouted the cable modem connection to that, and in minutes, they had working internet again. That stifled most of the complaints and questions ("Are we up yet?"), and I turned my attention to the mailserver. I had reconfigured it so it no longer wanted to be a gateway router, and had programmed the modem to forward packets inward to it, but mail wasn't flowing. All of my testing gave it a clean bill of health: locally, I could telnet to it on port 25 and give it mail which it would happily deliver, but no mail was coming from outside.
It had to be the router. Had to be. I ssh'd out to another site and send mail to this domain. It just hung in the queue. Why? I looked at the router again, but could not see any problem. No mistake there.
But there had to be a mistake somewhere, because no mail was coming in. I was tired and I wanted to go home, but they needed their mail fixed. Hmmm. They had a small block of ip addresses, and one had been set up for ssh to another box. Would that work? Yes, it did. I could ssh in from the outside world on that ip. So obviously router forwarding works.. what the heck could be wrong?
I ssh'd out again and this time tried telneting back in on port 25. This pinpointed the problem rather quickly: I couldn't resolve their domain. The reason I couldn't was because it had expired. My customer had thought they had renewed, but had actually just paid money for one of those "Internet Listing" scams.
Fortunately, this wasn't too hard to straighten out. Their registrar (DoubleDomains.com) actually answered the phone and we had them reinstated in minutes. We had to re-enter DNS info at Double Domains name servers, but that was quick and easy. Of course mail wouldn't start working for a few hours at best, but at least I could go home. My job was done.
Well, no, there was that Kerio switchover. We scheduled that for the upcoming Monday. They could live a few more days running days running as they were, and the Suse would arrive before then.
Suse is now available for free download but they had already ordered the boxed set and preferred to leave it at that. I tried downloading it anyway but the servers just timed out. Too popular right now, perhaps.
So, I went home. I called the next morning to check, and yes, mail was flowing, and everything was happy. Good. See you Monday. I had plenty of other things to do for the rest of the week, and had a busy weekend too.
I took my time getting there Monday morning. This switch promised to be a little stressful, mostly because of the physical challenges, so I didn't want to fight traffic. Besides, they'd need an hour or so to deal with the morning's email before I shut them down for the switch. I'd be shutting everything down, at least momentarily, because I wanted that KVM in place. I was tired of trying to identify the right keyboard in that closet.
So, that was my first task. Shut everything down, pull out all the extra monitors, keyboards and mice, and hook up the KVM. When it was done, I actually had room to put down my coffe, which was a big improvement. I brought everything back up, and confirmed that everything was working. Yes, it was, and of course mail was still flowing.
Well, that had to be stopped, didn't it? We're about to install a new OS and a new mailserver. Can't have new mail coming in right now. So I shut off forwarding at the router. I left SMTP running for the moment, but told people that . the mail server would be going down soon, so please do anything important now. I watched as the queue suddenly filled up with outgoing messages, and then I stopped listening on port 25 entirely.
Now I had to wait for the queue to empty. I didn't want to deal with reinjecting messages. I'd have to deal with imap stores and unread messages from pop users, but I didn't want any outgoing messages left over. Fortunately, it emptied out fairly quickly, leaving only three messages that couldn't be delivered right now. I tried forcing them a few times, but they wouldn't go, so I tarred them off along with the account list and any stored user messages.
The Suse install went smoothly, as did the Kerio, and I wrote little scripts to transfer the accounts and mail messages. No problems at all, everything worked beautifully. I turned the router forwarding back on, and watched the queue for incoming mail. This had been pretty smooth so far.
Yeah, right. Nothing was coming in the queue. Oh, come on! Mail was working this morning, why not now? I double checked everything locally. SMTP running, check. Router forwarding, check. Nothing in the error logs. What the heck is wrong now?
Amazingly enough, it was name resolution again. Their domain couldn't be resolved; in fact, the assigned name servers couldn't be contacted at all. If I did a "dig" at Double Domains servers, it just hung. No reponse, and would eventually time out.
I got a tech on the phone. Apparently at the same time my customer temporarliy lost their domain, Double Domains was planning a hardware switchover themselves. What seemed to have happened is that we had entered data on the "old" nameservers, but new ones were put up at just about the same time, and we got caught in the gaps - our info hadn't been transfered to the new servers. The old servers had been left up, which is why we had worked for several days, but had been taken down coincidentally concurrent with our switch. The tech was able to transfer the data, but of course once again the customer would have to wait for this to propagate. Oh well, at least I was really done now.
DNS had caught me twice at the same job. Both times were very coincidental and circumstantial, which at least momentarily caused me to check everything else first. I'm not sure the customer fully understood what had happened and why, but later on that afternoon I confirmed that email was indeed working again.
Got something to add? Send me email.
More Articles by Tony Lawrence © 2012-07-14 Tony Lawrence