Tao of Ops: Dealing with disappearing DNS

You're looking at your todo list and pondering what code to write during one of the brief moments of free time that appear on your daily schedule, when all of the sudden you get a message in team chat: Is the site down for anyone else?

It can be a frustrating experience, but never fear; you're not alone. We here at &yet experienced this type of outage once before, and then again this week. In fact, nearly every operations team has experienced at least a variation on the above nightmare. It is just a matter of time before you have to deal with people thinking your site or service is down when the problem is really with the Domain Name Service (DNS). Even shops that spend a lot of money to work with DNS vendors who themselves have some serious redundancy and scale will eventually fall prey to an orchestrated Distributed Denial of Service Attack.

So what did we learn when we were faced with an attack this week? Mainly, a reminder of the importance of redundancy. The best solution is still the simplest: have more than one DNS vendor. Now don't be fooled by the use of the word "simple" - while redundancy is the simplest, it is not a simple process to implement at all, but let's walk through what we will need.

Pick the two (or more) vendors. The crucial part for this is that both vendors have to have an API for changes to your DNS Zone records. If they don't have an API, you will be forced to make updates using their web interface and that just is not a recipe for success at all! Another criterion is that both vendors should have solid track records with dealing with DDoS events - no use picking a vendor that falls over at the slightest attack.

Once you have two vendors selected, the next step is to gather the tools to coordinate zone record changes with each vendor. One thing to watch out for is that any given tool is able to both create new entries and also adjust/change existing entries (each API has its own quirks to deal with, so do your homework). This can be a command line tool that they provide or that your configuration/change management tool supports, but most likely it will be a small set of scripts that you create to take the structured DNS data (JSON, XML, CSV, whatever is good for your team) and push the new/changed items to each vendor's API. A good example of one of these tools is what the Netflix team uses - denominator.

The last step is to adjust the nameserver list for each domain to contain entries from both vendors. This will allow for fallback lookups to happen if a DNS client request cannot be solved because the primary nameserver could not be reached.

With the above in place you will be able to continue on the Operations Path without having to fend off the zombie DDoS horde - always a good thing for you and your team.

You might also enjoy reading: