There's a REV drive on an old SCO Unix box in Boston that acts up now and then. A cold reboot fixes it for a while, but the problem always comes back eventually. It's been annoying, but never quite annoying enough to replace the unit.
Monday night at 9:00 PM the customer's Windows guy called me because he wanted to reset this machine and couldn't remember how to shut it down (that's probably some indication of the frequency of this "fix"). I gave him the command and when it came to the "Safe to Power off" screen I told him to do that.
He said it wouldn't shut off. I was a bit surprised that he was unaware of servers with power buttons that have to be held in for a bit to avoid accidental shutdown, but then again, this guy has crossed my path before and left me shaking my head in wonder.
I figured that was it, but he wanted me to hang in while he started the system back up. Fine. Go ahead.
"The switch must be broken."
We then had a few minutes of him mumbling things about the power switch, the reset button, things being "stuck" and finally came to the realization that he couldn't turn this machine back on.
"Can you bypass the switch?", I asked. No, apparently he couldn't . I did get him to pull off the front cover, but that was as far as he wanted to go. OK, so we'll get someone to fix it in the morning.
"Oh, no, they'll freak!", he exclaimed. Yeah, sure, I know: they are a 24/7 operation and nobody would be happy about the main server being down. I get it, but what other choice was there? Maybe there is 24 x7 computer repair in New York or Las Vegas, but I don't think there is in Boston. They'd have to live with it.
"Oh, man, I have to call Joy!". I wouldn't say he had fear in his voice, but he sure wasn't happy. Let's just say that "Joy" doesn't match her well and leave it at that.
"She might call you", he warned.
Not if she has any brains, I thought to myself.
"Or the owner might call you."
Well, fine. I LIKe the owner. I doubted that she would call until morning anyway, but if she wanted to chat now, fine by me. None of it mattered: nothing was going to happen until morning. I explained that again and hung up.
Sure enough, nobody called until 9:00 the next morning. It was the owner, very concerned about the problem. I told her that it was surely simple: it was most likely either the switch as her consultant insisted or the power supply and all she needed was a competent hardware guy to come fix it. No "Unix" knowledge necessary, just someone who stocks some parts and has a screwdriver and a brain.
Yes, we could have given a screwdriver to the Windows guy and sent him to go buy parts. That would have still left us wanting the other requirement.
So, the owner made some phone calls and very soon a pleasant sounding guy named Kevin was on his way in. He had more than one screwdriver and plenty enough brains to bring both a power supply and a spare box in case we needed to swap other things over.
When he got on site, he called me.
"Oh, yeah, it's the power supply - I can smell it. The one I have won't fit, but I think I can track down something that will."
OK, great. But I really did wonder about "I can smell it". If he could still smell it more than 12 hours later, why didn't last night's hero smell anything? Mysteries, mysteries.
Forty-five minutes later Kevin called again. Bad news. The power supply took the mother board with it when it went. What now?
Let me just digress a minute. I have no idea what really happened here, but it does seem to be quite the confluence of circumstance, doesn't it? We shutdown a functioning machine and upon powering on it blows. Blows hard, blows hard enough that it fries the m/b and the stink of it all is still around the next day and the Windows guy thinks it is the SWITCH?? Oh, well.
OK, so I asked Kevin if he could transfer everything to the box he had brought. Of course he could - he had screwdrivers and a working brain, right? So he did, and called me back shortly.
"Is it normal for it to hang initializing AMIRD?", he asked.
I hesitated. Yes, some delay is normal and even longer delays are possible but generally, no - this should have booted before Kevin got worried enough to call me about it. Could it be that whatever currents had surged through the motherboard had found their way to this controller also?
It didn't look to be. I had him reboot and go into the AMI BIOS with CNTRL-M - everything looked good there. My suspicion was that this was a BIOS incompatibility with the ASUS m/b in the box. I called for a conference.
The owner was getting very nervous at this point. Here we have a dead machine and we don't know yet if the data on the drives is good. We don't know the state of backups either, except that apparently the REV had been failing for two weeks prior (no, I do not know why they ignored this for two weeks!). This wasn't good.
They did think they had other backups. Apparently our Windows guy had set up Mozy backups sometime back and had several Windows machines running that. I asked if the Unix data was mounted so that it would be included in the backup; he said it was - on several machines!
I need to digress again. It's hard enough to do an Internet backup of a large data set, but he apparently is trying to do this multiple times? I had serious doubts that even one backup could finish quickly enough to be worth anything - yes, they don't do a LOT of work at night, but I'd guess this backup could take half a day and who knows how long with multiple instances. I figured all Mozy would have is a mish-mosh of files spanning many hours of time. In other words, useless and pointless.
Nonetheless, I recommended a multi-prong approach. First, we'd set someone trying to reach AMI to see what they'd recommend for firmware upgrades. Second, we'd start installing SCO on a plain vanilla IDE box and ask Mozy to ship whatever they have. Third, Kevin offered to go get another beater box with an Intel m/b of the same vintage (2004) as the dead box. Who knows?The REV drive might have a good backup - most of its failures were that it would lock up on the verify phase. The backup itself might be fine. If not, maybe Mozy had more data than I thought they did. Or, since the AMI Raid was just a mirror, maybe we could get at this with another SCSI controller. One way or another, we were going to get this running.
As it turned out, Kevin's beater Intel box was the answer. It booted right up and the only problem was redoing the on-board NIC. It was now 4:00 PM, but they were back running with no data loss. I told the owner she needed to hug Kevin.
I also suggested that she might want to talk to him about what other services he offers. I really think he might be a lot more competent than their current guy. I hope she takes my advice on that.
Going forward, a six year old box carrying your business life blood isn't a comfortable place to be. On the other hand, business is bad right now, so upgrading isn't a happy choice. Compounding this is the fact that their app vendor never supported Unix very well and is dropping all Unix support this year. It's time to move on - and quickly! I have been recommending that for more than a few years now - I think this scare will get them to actually do it. Yes, I lose a customer who I have known for almost 20 years, but that's OK. They need to get off this machine.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2011-03-10 Anthony Lawrence