The application vendor suggested that it was a hardware problem. However, after performing some diagnosis under the direction of HP, no problems were reported. Athough the files seemed to be scrambled, there was no signs of file system corruption after running fsck. I am not quite convinced that it was a hardware problem. Are there any ways to find out if it is a program bug, an OS problem or hardware issue?
There's no easy, absolute way to prove that the app is messing up the file, but you can make some efforts toward that end. Start by renaming the current data file and then copy it to its original name. Since the version with the new name uses the same disk space and inode numbers that the old file did, it should scramble if the disk hardware is messing up those areas of the disk (you check that by running sum on it regularly- it shouldn't change).
To check that a disk controller is scrambling things ( I have had this), you need to write some test programs that create similar sized files and write to them in similar patterns to the app. I find the easiest way to do this is to write the same random data to 3 or 4 files with the writes randonly spaced over a good chunk of time- then run sum on the files, which of course should all sum exactly the same. If they don't, the controller starts to look very suspect.
OTOH, memory could do this. *Very* unlikely, because this depends on the bad memory never being used by the OS (which would always cause a panic if used for code and surely would cause strange behavior otherwise) and always being used by the app *only* for data- but I guess anything is possible under some contrived circumstance, so to eliminate that, we write a little program that just fires up and allocates a nice block of memory similar to what the app uses (you can get that from ps -el). This app writes different patterns into its space, and reads it back (it's a memory tester, so you need different patterns to check stuck bits, bleeding bits, etc.). You need to start it before starting the real app. It's probably a silly exercise, but if you have to PROVE something to someone..
After all that, you are going to get the argument that some other program on the system is writing into the file (can you tell I've been through this once or twice?). If you have lots of disk space you can turn on auditing and show that no other program every touched the data file. If you don't have the space, you have to take running snapshots with fuser or lsof- that may not satisfy a very stubborn vendor who is convinced that *their* programs never screw up.
And when all is said and done, it's almost always "their" fault.
Not that I'm complaining about the income opportunities, of course :-)
Got something to add? Send me email.
More Articles by Tony Lawrence © 2013-07-27 Tony Lawrence
The people I distrust most are those who want to improve our lives but have only one course of action in mind. (Frank Herbert)