When you write data, it doesn't necessarily get written to disk right then. The kernel maintains caches of many things, and disk data is something where a lot of work is done to keep everything fast and efficient. That's great for performance, but sometimes you want to know that data really has gotten to the disk drive. This could be because you want to test the performance of the drive, but could also be when you suspect a drive is malfunctioning: if you just write and read back, you'll be reading from cache, not from actual disk platters.
So how can you be sure you are reading data from the disk? The answer actually gets a little complicated, particularly if you are testing for integrity, so bear with me.
Obviously the first thing you need to do is get the data in the cache sent on its way to the disk. That's "sync", which tells the kernel that you want the data written. But that doesn't mean that a subsequent read comes from disk: if the requested data is still in cache, that's where it will be fetched from. It also doesn't necessarily mean that the kernel actually has sent the data along to the disk controller: a "sync" is a request, not a command that says "stop everything else you are doing and write your whole buffer cache to disk right now!". No, "sync" just means that the cache will be written, as and when the kernel has time to do so.
Traditonally, the only way to be sure you were not reading back from the cache was to overwrite the cache with other data. That required two things: knowing how big the cache is at this moment, and having unrelated data of sufficient size to overwrite with. On older Unixes with fixed sized buffer caches, the first part was easy enough, and since memory was often expensive and in shorter supply than it is now, the cache wasn't apt to be all that large anyway. That's changed radically: modern systems allocate cache memory dynamically and while the total cache is still small compared to disk drives, it can now be gigabytes of data that you need to overwrite.
Well, that's not always so hard: for a large filesystem and relatively small memory, a simple "ls -lR" might be enough. If not, a "dd" redirected to /dev/null can fill it up. Just make sure that you are looking at different disk blocks than what you first wrote. Note that you really didn't even need the "sync" if this is what you are doing: the overwrite forces the sync itself.
Modern Linux kernels make this a bit easier: in /proc/sys/vm/ you'll find "drop_caches". You simply echo a number to that to free caches.
To free pagecache:
To free dentries and inodes:
To free pagecache, dentries and inodes:
You absolutely need to call "sync" before doing that. I haven't looked at how this is implemented; I assume that the pending syncs would be done before the cache is actually thrown away, and that in the meantime the cache is now seen as invalid so subsequent reads would have to wait for the sync write before returning. It would be simple enough to test this.
Actually, maybe not. I tried testing this on a Suse instance in a virtual machine, and couldn't do it. The script I used looked like this:
What I expected was for /tmp/t not to have the latest date. However, it always did, probably because the Reiserfs would fix up partial transactions. You'd need a system without a journaled file system to test this.
But even that didn't seem to work: I created an ext2 fs on another virtual hard drive and tried this:
But that didn't behave as I thought it would either. Possibly VM caching is throwing this off? Nope: I tried the same thing on a real system; the file doesn't lose its updates. So I'm not sure you can trust drop_caches.
However, if testing for integrity, and perhaps even if doing serious performance testing, this isn't enough: disk drives almost always do their own caching. If we really need to be certain that our reads came directly from the platters and not from ram on the controller, we still need to go back to the idea of knowing how big that cache is and writing enough data to force it to be flushed. So, we are still going to do "dd"'s or "ls -lR"'s or something like that.
If you are examining integrity and suspect corruption, keep in mind that aging can affect your results: you might need data to sit in cache (kernel or disk hardware) for some period before the problem occurs. Quick overwrites might mask it. Tracking down this kind of problem can be very difficult.
By the way, if your aim is simply to bypass cache buffering, you can do that: Raw Disk I/O is what you want. And (as some databases do) you could simply write data to a raw partition (no filesystem).
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2012-08-01 Anthony Lawrence
A refund for defective software might be nice, except it would bankrupt the entire software industry in the first year. (Andrew S. Tanenbaum)