A long time customer called on Thursday. "We had a power failure in part of the building yesterday and now some stuff doesn't work. Can you come down?"
"Just reboot the Windows machines", I said. My assumption was that a Windows machine or two had held on to an ip address that was handed out to another machine. This will happen if the router gets knocked out (it's their DHCP server) and some of the Windows machines are not.
"No, that's not it", he said. "It's the new server and the printer in the Parts department".
Hmmm. That is different. That printer definitely has a static address. I'm not so sure about the new server because I didn't put it in - it was supplied by somebody having them evaluate new POS software - but it certainly should have been static. I asked him to check the network setup in that server. "I already did", he said. "It was set to 'Obtain an IP address Automatically'. I changed it to the static address. It works internally, but the software guy can't get to it from his office. He says our router must be confused."
Yeah, like he knows anything about your routers, I thought to myself. Well, that I can check from here, so I did. No problem, and still set to forward the right ports to the right places. But he was right about one thing: outside traffic wasn't getting to that server.
"Are you sure you set it to .103?", I asked.
OK, I'll come look. What the heck, I haven't been out all week anyway. It's a nice ride..
When I arrived, I decided to look at the printer first. The software guy getting in remotely is of less importance than the people in Parts being able to print. I first printed out the config page and noted the ip address ended in .122. That didn't sound right to me, so I checked the hosts file on the Unix box and yeah, it should be .170. Hmmm, what the heck is this? It was not set for DHCP, so somebody had put that in manually..
I changed it back to .170 and tried to ping it. No response. I powered it off and tried again. No response. OK, time for some forensic history. I tracked down the store manager and asked him what he knew about the printer. He knew a lot.
"It wasn't working, so I set it to 'Automatic' IP. That didn't work."
Well, yeah, of course not. The Windows machines may have been able to find it by Netbios, but the Unix box would be looking for .170 so that couldn't work.
"So then I tried setting it to the same IP as the server."
That's more common than you might think. Of course you cannot use duplicate ip's on the same network, but people sometimes think "Well, this magic number works over there, so maybe it will work HERE". I explained the folly of that. And then what?
"And then I remembered I have a spreadsheet with all the IP numbers in it, so I set it to that."
I had joked about not being able to make much money if customers were going to write things down, but I guess if the wrong thing gets written down it's almost worse, and it looks like that's what happened here. His spreadsheet was wrong. But the printer still wasn't working even with the right IP.
I went back for another look. This time I brought my cable testing tools from my car. A power failure isn't going to hurt a cable, but I didn't think it had blown the printer either as the failure hadn't been in that part of the building. Might as well check.
So I did. The cables were fine. I plugged them back into the switch and tried a ping again. It worked..
Hmmm. Loosely plugged into the switch maybe? I hadn't noticed that when I had unplugged to test. I took the cable out and plugged it back in. I tried the ping - it didn't work. I unplugged the cable again and moved it down the switch to another free port. The ping worked and stacked up print documents started coming out..
I went back to the manager and explained what I had found (bad port on the switch). "Oh, I tried that too", he said.
Ahh. A troubleshooting mistake. Who knows what the original problem was, but it could have gone something like this: port on switch goes bad coincident with power failure. He then changed the IP address on the printer. He then moved the cable to a different port. It would have worked then, but of course he had the wrong IP. He then moved it back.. the problem is not controlling your variables: you have to change ONE thing at a time.
OK, printing is fixed but I warned them to keep an eye on that switch. It's fairly new, so shouldn't be failing. On the other hand, it's just something cheap he bought at Staples or wherever - a little 5 port, so no big deal to replace. On to the Terminal Server problem.
We knew that terminal services were working locally. I knew that the router was probably fine - if it were not, email and other inbound services would probably be dead also. But outside connections didn't work. I checked the server firewall settings first, they were fine. I looked in the logs, no indication of problems. I scratched my head for a second and then I thought "Spreadsheet!" and opened up the network configuration. Sure enough, he had an incorrect default gateway set. I corrected that, and of course inbound services now worked.
Writing things down can be helpful, even if the things written are just "magic" to the person writing them. But when the "magic" is wrong, not understanding the actual reality leaves you with nowhere to go. If you understand why the gateway is necessary, you'd also know how to check it and how to get the proper gateway from any working system.
I had him update his spreadsheet for the next time, though I don't know why this machine would have lost its address. I wonder if it really did.. it's possible that it was just slow recovering from the power failure and that when he changed it, he broke it by having the wrong gateway.. it doesn't really matter, we'll see what happens next time - or they could put better UPS devices in place!.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2011-03-10 Anthony Lawrence
I always knew that one day Smalltalk would replace Java. I just didn't know it would be called Ruby. (Kent Beck)