A few weeks back one of my larger Unix customers called about some small issue. We dealt with that, and he then off-handedly asked "More memory should speed things up, right?"
I hesitated and said "Maybe.. but not necessarily". I then asked why he wanted to know. He explained that at one of their sites they were experiencing some performance complaints. As it happens, this site has the largest user base, and the RAM in the box was only 2GB, so they had added another 2GB but the slowness persisted.
It was then my turn to explain a few things. First, slowness can come from many causes and you really need to identify WHY a system is slow before throwing fixes at it. Yes, it could need more RAM for user processes, but it could also be disk bound, cpu bound.. the RAM might need to be used for I/O buffers, for other tunables.. but we can't know any of that if we haven't identified the real problem.
So we arranged for me to log on to take a look. My first impression from a simple "w" was that performance looked fine. I looked at sar, sar -d and sar -r and still saw nothing. Sar was not turned on, so I had no historical data to compare with, but the system seemed fine to me. I enabled sar reports and arranged to check back later.
As it happened, they called me again a few days later for some printer problems. A LaserJet wouldn't print; a power off, power on of the printer fixed that. A serial printer on a Digi Portserver wouldn't print.. it turned out that the PortServer had been decommissioned and nobody was supposed to use that printer.. not much I could do about that except have any jobs go to a different printer. But as long as I was logged in, I thought I'd take a look at sar.
This time there was definitely a problem. Sar showed 0% idle, 84% user cpu pegged.
I immediately ran a "ps" and spotted a "foxrun.pr" process that had been running about 36 hours and had accumulated almost that much in CPU time. Obviously that was the problem. Using "-o args" I was able to see the full command line and the customer identified it as a weekly process that was expected to take an hour or so. I killed it and sar immediately showed 94% idle stats.
What probably had been happening is that this program would foul up sometimes, suck down performance, and after a day or so the local people would get sick of it and reboot ("stupid Unix machine needs to be rebooted every week!"). That would "fix" the problem until the next time the program screwed up.
While I as still on the phone my customer had a conversation with his off-shore Foxpro programmer about the program, explaining what I had observed. I could not understand the programmer very well, both due to his accent and the scratchy phone line, but I had the impression that he at least thought he could see where he might have gone wrong. My customer will watch that process a little more closely over the next few weeks.
It of course is possible to write scripts that look for run-away processes like this and kill them off. Foxpro seems to be particularly vulnerable to such errors; I've had to write scripts like that at other places where the programmers could not or would not fix the root problem.
You can also sometimes use "ulimit -t" to prevent such processes from dragging down the rest of the system.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2009-11-07 Anthony Lawrence