Yesterday I went out to try to recover a SCO system running Medical Manager. What I found was an old system with a DPT controller set up as a RAID 5 with three drives, but one dead.. and it had crashed, and crashed hard.
According to the customer, the drive had failed some time ago.. and they had sometime tried to replace it, but it had never come back on line. The system was sitting at a single user root prompt, and she explained that if they continued to normal startup, they would not be able to login. The first thing I looked at was /etc/passwd - it was gone..
I don't mean "gone" as in the symbolic link being missing (SCO uses a lot of symlinks). Nor do I mean the the real password file down in /var/opt/K was missing.. it was listed in the proper directory there.. but the inode that directory entry pointed at was trashed.. we had disk corruption.
And it was not limited to that. Looking in /dev and other system directories, "ls -l" would show strange characters in the permissions field.. obviously there had been a hit to the inode table. I turned fsck loose and it quickly filled up /lost+found..
Well, that's OK. They use Microlite Edge for backup, so I figured I'd be restoring data anyway. I could not get "edgemenu" to run, but "edge" at the command line worked fine, so after fsck finished, I started restoring files. I ran into two immediate problems: some files that should have been directories were not, and we were just about out of disk space after the restore. The first problem was easily fixed by removing the corrupt ex-directories, but the second turned out to be more of a problem: I started removing files in /lost+found and about ten minutes in, the system panicked.
Ooops.. so the lost+found files themselves were also corrupt.. that's really bad.. but the system did reboot and we were back at "CTRL-D for normal system startup". I asked for the root password (remember, it had been at a shell prompt when I arrived) and the person I was working with said she didn't know..
But she must know.. she had obviously typed it in before.. well, apparently she had been on the phone with Sage Software (Medical Manager) when she had done that, but couldn't remember what they had her do. No problem, we'll just get Sage back on the phone..
Nope, not available right now.. we left a message and I sat down to wait. But.. maybe I could go multi-user, because I had restored /etc/passwd, so I tried it and was able to login as "ccmenu".. on this system, that's a superuser account and it gave me a menu that included "Unix Utilities", and that menu included "Unix Shell" as an option.. unfortunately that wanted a password too.., but "Read Mail" did not, so I did "!/bin/sh" within mail and got to a "#" prompt. I changed the password, did an "init 1" and ran "fsck -ofull -y /dev/root".
That found a fair amount of problems, but most were in /lost+found and the ones that were not were on text report files.. so not so bad. When it finished, I ran it again and all was clean.. we went multi-user and I had them run reports to prove out the data, that all passed so things were looking good.
While they were checking more reports, I talked with the owners about the foolishness of continuing with this system. I explained that SCO was in dire straights, might not last much longer, and that their old (3.2v5.0.5) OS wouldn't be able to be installed on modern hardware.. I strongly suggested that they see what Sage could offer for an upgrade.
It turns out that Sage now offers Medical Manager on RedHat Linux. That's great news, because the owners did NOT want to move to a Windows system.. they've seen too many problems to fall for that. So they have asked Sage for an upgrade quote and will move on that very quickly.
About then we ran into a problem: someone had tried to login in a second screen and got a message saying that that they couldn't use /dev/ttyp35.
I took a quick look and saw the problem. The /dev/ttyp entries on this system should be major number 58 and the minor number should match the pty number. So ttyp0 is 58,0 and ttyp35 should be 58,35 - but it was 54,19 instead.. again, inode corruption.
That's easy to fix, though: rm /dev/ttyp35; mknod /dev/ttyp35 c 58 35 will do it. However, a few minutes later it happened to a different pty, so the corruption is ongoing: this is a real hardware issue, not the result of a crash. That puts a certain urgency into changing machines. I explained that SCO 5.0.5 would not work with any modern hardware and that while we certainly could still find systems that it would work with, that would take a few days at least.. and seemed to me to be throwing good money after bad. If they could just limp along with this system, being aware of the potential danger of corrupting customer data, I felt they'd be better to just switch to Linux and abandon this as quickly as possible.
We were able to speak to Sage soon after that, and they said that although a new machine would take a few weeks, they felt they had some used boxes that they could supply in the meantime - a little extra expense, but probably less than wasting money on SCO.. so that's the plan for the moment.
I saw no point in trying to get the third drive working: the controller could be bad, and it could get much worse if I try to change anything.. in this case, I think it's best to just let sleeping dogs lie.
I checked with the customer this morning; no more corruption in the ttyp's, and all data is still proving out.. so maybe they can survive this without much downtime. I sure hope so.
It's good that Sage can move them to Linux. It makes no sense to throw more money into the old box, and it certainly makes no sense to stay on SCO.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2009-11-07 Anthony Lawrence
If you don't know anything about computers, just remember that they are machines that do exactly what you tell them but often surprise you in the result. (Richard Dawkins)