Aging is an interesting phenomenon. We're all aware of what aging brings about: aches and pains, fading memory, the discovery that you have a favorite easy chair, a new-found interest in Viagra. Yet we don't see the effects of the aging process on a day-to-day basis. Like evolution, it takes a long time for age-related deterioration to become apparent. And when it does, the discovery can be an unpleasant experience -- or at least an instructive one.
Computer hardware ages too, and deteriorates from constant use. Case in point: I was contacted the other day by a colleague whose client had recently experienced a mysterious rebooting of their OpenServer host. This is a machine that our company had built and installed early in 1998. By today's standards, this server is ready for an easy chair. Yet, in more than 5-1/2 years of continuous operation, the client had experienced absolutely no unannounced downtime with this unit, which says a lot about the general stability of the UNIX environment -- and, of course, the quality of our products :-).
My first thought was that this was just another situation where the power had failed long enough to cause the uninteruptible power source (UPS) to trigger an automatic shutdown due to a low battery. The server lives in an office in Chicago, long-famous for deep-dish pizza, losing baseball teams and unreliable electrical service. Power blips, sags, surges, spikes and outright failures are the norm during the summer, especially given all the air conditioning machinery in operation.
However, the client was absolutely certain that no protracted power outage had occurred. This piqued my curiosity, given the total lack of problems experienced with this machine. Logging into the server, I rummaged around for a while but could not find any kind of obvious problem. syslog had nothing in it that caught my attention, other than an occasional complaint about no tape being loaded at backup time (as an aside, syslog had grown to be quite huge, which suggested to me that the client was not reviewing the log at regular intervals). syslog did show that a normal shutdown had occurred around the time in question, but, of course, didn't enumerate a cause.
Not finding any kind of smoking gun to say otherwise, I was reasonably certain that UNIX itself was not the culprit. On this machine, a kernel panic would not have caused an unannounced restart, as the PANICBOOT option in /etc/default/boot was set to NO. Plus whatever lead up to the panic might have logged something into syslog -- no such entry existed. While the notion of a power-induced shutdown was still dancing in my head, I was also wondering if maybe I was seeing some sort of imminent hardware failure looming.
A problem experienced in some modern servers is microprocessor overheating. Today's CPU speeds are much higher than the screaming 200 MHz of this relatively old machine (although 200 MHz did seem awfully fast when this unit first went into service), and thermal factors are much more important than they were in the past. I have repaired relatively new servers that had unexpectedly gone down because the CPU had overheated, often the result of a dirt-clogged CPU cooler.
However, that was not a likely scenario with this particular machine. Its processor simply doesn't run all that hot, its CPU cooler has relatively coarse fins that don't tend to collect dirt, and there is no automatic thermal shutdown feature in the motherboard hardware -- the CPU would simply go up in smoke if it did overheat.
I also considered possible memory or other hardware problems, for example, a failing host adapter. None of these were likely, as they would have resulted in the kernel voicing complaints to various logs.
Given all this, it was hard for me to ignore my original conclusion: extended AC power failure had occurred and the UPS had shut down the server. Yet the client insisted that wasn't the case. This seemed to make no sense to me at all, as this particular UPS -- an industrial strength ferroresonant unit -- was capable of sustaining operation on battery for close to an hour, assuming a fully charged battery. Nevertheless, I combed through the UPS log and, sure enough, a short duration power loss had occurred around the time the mysterious reboot took place.
Unlike the inexpensive line-interactive and double-conversion UPS's that have flooded the marketplace (a classic example of caveat emptor, if there ever was one), a ferroresonant UPS isolates its load from the power line with a special transformer whose design tends to make it act like the electrical equivalent of a flywheel. When a momentary reduction or loss of power line voltage occurs, the massive ferroresonant transformer replaces the missing AC cycles from energy stored within the transformer's iron core. Aside from totally shielding the load from power line cruft, this flywheel action gives the battery/inverter segment of the UPS more than adequate time to come on-line and sustain the output, resulting in no measurable break in the output. Naturally, the transformer can only replace a small number of AC cycles, which means unless power comes back relatively quickly, the battery must be ready to immediately shoulder the load.
That is what should have happened in this case. However, the UPS log showed that that the UPS shutdown came within a few seconds of the power interruption. The cause of the mysterious restart quickly became apparent: the UPS' battery was simply unable to produce enough output. As it turned out, this was the original battery and thus was nearly six years old. Father time had managed to do what Chicago power could not do in the past: trigger an unannounced server shutdown.
The moral of this story: load test your UPS at least every six months to determine if it can sustain operation following power failure. Change out your UPS batteries at least once every four years, or better yet, every three years. Naturally, if you are not comfortable with working inside a UPS, seek someone who is. Better he should get the daylights shocked out of him than you!
While on the subject of aging, another gotcha involves hard drives. The server in question was built with a Seagate Barracuda SCSI hard drive, long considered one of the best and most trustworthy designs ever conceived. Seagate's specifications for the Barracuda line at the time stated that these drives had a five year service life. Service life is determined by a combination of past experience, wear measurements taken on laboratory samples and failed units returned under warranty, mathematical extrapolations, educated guessing, and possibly by tarot card reading. In other words, if the manufacturer says the drive will survive five years of service, consider that to be an optimistic estimate, not a take-it-to-the-bank fact.
Generally speaking, expected service life will decrease as the drive is subjected to more start/stop cycles. Start/stop cycles cause wear to the heads and media, an unavoidable consequence of media contact as the heads land in the parking zone. However, in the case of most servers, start/stop cycles tend to be infrequent. For example, this client's server has averaged only two to three cycles per year. Therefore, drive aging has been primarily the result of spindle bearing and head actuator mechanism wear.
Modern SCSI drives are able to automatically spare out bad blocks that develop as the unit ages (resulting in what are referred to as "grown" defects). As a result, most of the effects of wear tend to not be noticed at all, something that is not true of IDE drives. In a continuously running drive, spindle bearing and head actuator mechanism wear is quite gradual. Like a slowly weakening UPS battery, the effects of such wear tend to go undetected until either the drive becomes noticeably noisy or an abrupt failure occurs. As a rule, such failures result in immediate and total loss of all data (your backups are current, right?).
The only sure way to avoid a catastrophic, wear-induced drive failure is to replace the drive before the published end-of-service life has been achieved. If you have any clients running on older servers, be sure to get this information in front of them so they can plan for the inevitable. After all, putting new hardware into service is always more fun than dealing with the aches and pains of age.
Got something to add? Send me email.
More Articles by Steggy © 2011-03-10 Steggy
The successful construction of all machinery depends on the perfection of the tools employed; and whoever is a master in the arts of tool-making possesses the key to the construction of all machines... The contrivance and construction of tools must therefore ever stand at the head of the industrial arts. (Charles Babbage)