This is based on a real incident, though the facts have been simplified a bit to make it easier to follow.
"Type cd space slash e t c", I yelled.
Somewhere in Ohio, a slightly frazzled service tech questioned what he'd heard: "cd slash e c t?"
When doing Unix support with a Windows user, I always try to be very patient and very friendly. Having a lousy phone connection doesn't help this process: it's hard to sound friendly when you have to shout. But at least we were making progress. When the call started, the machine had crashed and was not coming up. Actually, that turned out to be not quite true, but it looked that way to the people on-site, and, given the circumstances, that was a reasonable conclusion.
From their point of view, here's what happened: all serial printers suddenly stopped working. At the same time, a "PANIC: clfree - Free block " message appeared on the console screen. Being unable to do anything else, they powered off, and when the system came back up, it of course needed to run fsck. They did that, but the system just "went dead" after the fsck - no logins.
I was a little confused at first, because they told me this was a SCO 5.0.5 system, but you shouldn't see that clfree panic after 3.2v4.2. It's possible to have a similar problem on 5.0.5, but the message would be "PANIC: HTFS: Free block freed on HTFS" normally. Either way, I had a good idea what part of the problem was. The solution would be simple: get to single user mode, run "fsck -b -s -y /dev/root" or "fsck -ofull", possibly update some patches and we'd be done. But it wasn't going to be that easy.
In fact, the scratchy voice at the other end told me that he'd been trying to get to single user mode, but without any luck. No matter what he did, the system would either panic again, or just "go dead" on him, necessitating a power cycle reboot. Hmm. That didn't sound good. Maybe missing some important files, like inittab? But no, as I had him read me what he saw on the screen, it was apparent TCP/IP was starting up. He had no logins on the console, and had no Digiboard connnected terminals anyway, but I asked him to try to telnet in from his laptop, and to his surprise, he got in.
I knew what was wrong now: one of the rc scripts hadn't finished. Inittab is set to "wait" for the rc scripts to finish- if one does not finish, getty's never start on the console or on the Digiboards. If you have TCP/IP, that will have started before these scripts, so you can telnet in.
But which script? If this had been 5.0.5, I would have looked at /etc/rc2, as described at OpenServer 5.0.5, system hangs just before the login prompt when booting to multiuser mode.. However, a "uname -X" told me that this was 3.2v4.2 as I had suspected. So, I had the tech do this:
cd /etc/rc2.d ls -lut
(See Troubleshooting for the why behind that).
I asked him if all the dates he could see were the same. He said most were, but the last was dated several days ago. That told me that this script had never been reached, because some other script was hung. I then had him do
ls -lut | head -1
That told me that the LAST script executed was S88USERDEF. Taking an educated guess, I had him immediately do:
cd /etc/rc.d/8 ls -lut
and asked again if the dates were all the same. He said (as I expected) that there were only three files, "pcu", digscr", and "userdef", and that "userdef" had an old date on it. I asked if "pcu" was the first file listed, and he said it was. I asked him to look at it with "more", and he said it was "gibberish". That shouldn't be: that is a text script that is part of the Digi initialization. I asked him to edit it and put a "exit 0" as the first line, and then to type "reboot".
This time, the system happily came to a "Control-D" prompt. I had him put in the root password, and run "fsck -b -s -y /dev/root". That had to clear a lot of files, but I could tell from the modes that these were temporary files and named pipes, so I wasn't too concerned. After fsck finished, we went multi-user, and everything appeared to be working, except that none of the Digiboard printers worked. That didn't particularly surprise me, as we had short circuited an initialization file and may have had a defective board anyway. I asked the tech if he knew about "mpi" to run Digi's diags, and he was already familiar with that, so at this point I left him, suggesting that he at least should try downloading new Digi software, but that a better idea would be to put the printers on a print server and eliminate all need for serial ports. He agreed that was a good idea.
I do not know what caused the problem with the Digiboard file. I did suggest that this anomaly and the file system corruption might be an indication that the hard drives or memory could be failing, and cautioned that he should be religious about backups and consider an upgrade to new hardware as soon as possible. He assured me that was already planned.
The combination of the clfree panic and the /etc/rc.d/8 hanging made this a more difficult problem than it otherwise would have been.
Got something to add? Send me email.
More Articles by Tony Lawrence © 2012-08-02 Tony Lawrence
FORTRAN's tragic fate has been its wide acceptance, mentally chaining thousands and thousands of programmers to our past mistakes. (Edsger W. Dijkstra)