Recently I came across a posting that was explaining "sort" to someone else. The example given was sorting a list of multiple columns:
abc 123 jat 458 cse 247
and because they wanted to sort on the second column, the command was given as "sort +1".
It was the explanation that caught my eye:
Remember that the first word is 0, not 1 (this is due to really old legacy issues that also crept into the C language, btw)
I thought that was a pretty funny way to explain it. Using the word "legacy" makes it sound like it had to be done this way because something older would break if it were not, or perhaps that you used to have to do it this way but now that computers are more powerful it's a silly artifact. Not really true: it never HAD to be done that way, but programmers think that way for good reason.
When I started learning about computers, most of my peers in the field knew at least a little bit about programming. You really had to if you wanted to be able to do much of anything. That's not true today: many tech support people know nothing about programming, sometimes not even simple scripting. Even though I understand that they really don't absolutely need that knowledge to do their jobs, it still amazes me that so few have it, and it surprises me even more that so many are apt to have a very negative attitude toward the subject.
But given that, it's not surprising that these people don't know why programmers count from zero.
We're going to dive right into the gory guts of that sort command and see why that starts counting from zero. You don't need to know a thing about programming, you just need to think about how things can be stored in memory. That's what it's really all about: putting things into memory, manipulating them in some way, and letting you know the results. That is programming.
So, if we're going to put things in memory, we need to know where we put them if we are going to do something to those things later. Every single location in memory has an address. You get to a particular address by storing a number in a CPU register. So if you want to look at memory address 100, you'd stick 100 in one of the CPU registers, and voila, that's pointing at memory address 100 (or is pointing at nothing and is just storing 100 for you - but we're not going to get into how CPU registers work here).
CPU registers can hold numbers from 0 up to however big they are. That could mean numbers from 0 to 255 (an 8 bit register), 0 to 65535 (16 bits) and on up. But notice it's always "0 to something". So if having 100 means you are pointing at memory location 100, what does having 0 in there mean?
Well, it could have meant "you made a big mistake, the first location is one". Actually, because of the kinds of mistakes programmers sometimes make, it might have been useful and interesting if CPU's had been designed that way. But they weren't, and actually I've simplified things here, so of course now we need to get a little more complicated.
When CPU's want to address memory, they actually usually do so by using two registers: a base and an offset. So you store 100 in one register, and 2 in another, and that means you are pointing at memory address 102 (100 + 2). What's the point of this? Why use two when one would do? Well, some of that is due to some legacy issues of early CPU's having some hardware limitations: it's called segment addressing, and it will make you unhappy, so we won't get into that here. But it also helps program design in general.
(That's not really how segmented addressing works but the rest will be easier for you to understand if we pretend it is.)
If 100 is in our base register, and 0 is in the offset, there we are pointing exactly at address 100. Add 0 to 100 and you get 100. Prgrammers use this to build higher level structures called arrays.
Let's say you are going to store 500 bytes starting at address 3,000, and another 500 bytes starting at address 7,000. You want to compare those two collections of bytes, add 'em together, subtract one from the other, whatever. You are working with CPU registers and you want an easy way to access those bytes. Using the base plus offset scheme makes that much easier, especially if you can have more than one base register. Base register one stores 3,000, Base register two stores 7,000 and to run through both areas in sequence we just need to increment our offset register. In higher level languages (which always use these low level registers underneath of course), we might call the stuff at 3,000 "dogs" and the stuff at 7,000 "cats", and "cats(70)" would be whatever was at address 7,070. Naturally, "cats(0)" is the contents of 7,000 itself. The high level language could have arranged things so that "cats(1)" was 7,000 and "cats(0)" was meaningless or an error (and again, for various reasons that might have been useful). But high level languages are closely tied to what has to happen underneath, and to do that would mean a subtraction at the low level (or a complete redesign of CPU hardware!). That would make the code that much slower: it may not seem like much, but subtractions add up, and manipulating arrays is a large part of what programs do. Nobody wants slower code, and there's also the fact that the extra subtraction introduces another place where the compiler (the thing that turns the high level code into actual CPU instructions) could be coded wrong and screw up.
So, this is why programmers count from zero. I think that if sort were re-written today, it would probably not have that 0 offset syntax, but when it was written, most all of the people likely to use it understood base plus offset programming and it would have seemed unnatural to them to say "sort +1" if they wanted the first field. Obviously "+1" is the second field if you are thinking of offsets, right? So there we were, and here we are today. It is a legacy, isn't it?
It is interesting to think about what would be different if CPU's treated base + 1 as equal to base, and raised a trap if an attempt was made to address base plus zero. If that were the design, programmers would count from 1, arrays would never start at 0, and certain types of programming errors couldn't be made. But that's not how CPU's work, so programmers do count from zero.
Got something to add? Send me email.
More Articles by Tony Lawrence © 2010-10-27 Tony Lawrence