Round robin DNS is a leading technique for providing a high level of availability of some service (typically http/web site) and for providing load balancing.
Large web sites need to be able to handle a huge number of requests - often more than a single web server can handle. The solution is to hand off incoming requests to a group of servers that each handle a portion of the load. The most convenient - and least expensive - way to do that is to use Round Robin DNS.
Round Robin DNSsimply takes advantage of the way most DNS servers provide DNS records. Specifically consecutive queries provide the same results but in a different order. One website using this technique is cnn.com.
Using dig (dig cnn.com) I got the following results:
cnn.com. 300 IN A 18.104.22.168
cnn.com. 300 IN A 22.214.171.124
cnn.com. 300 IN A 126.96.36.199
cnn.com. 300 IN A 188.8.131.52
cnn.com. 300 IN A 184.108.40.206
cnn.com. 300 IN A 220.127.116.11
cnn.com. 300 IN A 18.104.22.168
cnn.com. 300 IN A 22.214.171.124
And a few seconds later I got this:
cnn.com. 266 IN A 126.96.36.199
cnn.com. 266 IN A 188.8.131.52
cnn.com. 266 IN A 184.108.40.206
cnn.com. 266 IN A 220.127.116.11
cnn.com. 266 IN A 18.104.22.168
cnn.com. 266 IN A 22.214.171.124
cnn.com. 266 IN A 126.96.36.199
cnn.com. 266 IN A 188.8.131.52
What's interesting to note is that the records are all the same but in a different order. In fact, the sequence is unchanged as well; 184.108.40.206 always follows 220.127.116.11.
So when you browse cnn.com you might get your response from the server with the IP address 18.104.22.168 while the next browser might get a response from 22.214.171.124.
In this fashion the aggregate load being placed on cnn.com is distributed over 8 servers. And, if the load increases you simple add a server and another DNS record. So far this looks like an ideal solution - simple to setup and administer and inexpensive too.
Let's consider the case where one of the servers stops working. Those unlucky folks who get the IP of the failed server just won't get a response. So you call your buddy Mario across town and gasp that the whole of cnn.com is down. Mario tries it and gets a response since his DNS inquiry got the IP of a server that was running. Unfortunately 12.5% (1/8) of the NEW inquiries will get the IP of the failed server. That's no good!
Well, I don't think that the sysadmins at cnn.com are asleep at the wheel - in fact they probably have a klaxon that sounds whenever a server takes a dive. And when a server fails the sysadmin does one of two things: she rushes over to the DNS server and removes the DNS record for the failed server (and synchronizes the other DNS servers). The second thing the sysadmin might do is simply confirm that the program monitoring the failed has already removed the corresponding DNS record. There is, after all, no reason that this task can't be automated. That way the 12.5% of new inquiries is really restricted to *12.5% of new inquiries* while the DNS record for the failed machine is still available.
Also, note that the TTL (Time To Live) of the DNS records is quite short at 300 seconds. This is done to control how long the DNS record survives in each downstream DNS server. It's not reasonable - or even desirable - to expect all the DNS servers in the world to make a new DNS inquiry each time someone wants to browse cnn.com. Neither is it desirable to set the TTL too long, since you wouldn't want a failed server to show up any longer than necessary (TTLs are typically set to 3600 or 7200 seconds or even longer). In short, when a server fails you want it's DNS record to be expired off all the DNS servers in all the world as soon as possible.
Of course, when a server is ready for use you only need to create a DNS record for it and wait for it to propagate, which won't take long since it's peers (other servers) have DNS records with short TTLs as well.
It's not necessary to setup and maintain your own DNS servers at all. AOL is providing this service for cnn.com (they're part of the same company) and UltraDNS provides similar services as well.
There are a few downsides to this arrangement. First, while round robin DNS load balancing provides an elegant and inexpensive solution for load balancing it does in no way provide high availability, Indeed, the very mechanism of it's elegance is beyond the control of the sysadmin! Second, some Internet Service Providers (ISPs) routinely substitute longer TTLs where they feel -arbitrarily- that your published TTL is too short. Finally, many desktops perform their own DNS caching or are behind servers that provide DNS caching.
As an aside, some routers provide for a feature that really works against Round Robin DNS, namely negative caching (RFC 2308 - Negative Caching of DNS Queries). This evil feature records sites from which no answer was received. This means that a 10 minute server outage can be locally multiplied to some arbitrary TTL (negative TTL?) often 30 to 60 minutes.
Also, since Windows 98, MS Windows has been caching DNS locally as well undoing the benefits of Round Robin DNS. Although this feature can be turned off most Windows users are unaware that their machines are caching DNS at all.
In summary, Round Robin DNS Load Balancing is an effect, robust and inexpensive method of load balancing with some warts and bumps. It should not be taken to be a solution for high availability computing.
Got something to add? Send me email.
More Articles by Dirk Hart © 2011-04-28 Dirk Hart