Choosing the right database for the job

October 01, 2014Nathan Fritz

The NoSQL "movement" in database design has been motivated by many factors (such as simplicity and scalability) and has resulted in many more choices among storage and retrieval solutions. Instead of a one-size-fits-all approach, you can choose a database that is optimized for your specific needs.

So what are your needs, and how do you choose the right database for the job?

If you don't need a database cluster, and can live with a single node and snapshot backups, then you can pretty much do whatever you want.

The CAP Theorem

If we assume you need a cluster for high availability, let's talk about the CAP Theorem. Wait, I just lost some of you to eye rolls and "this doesn't apply to me," but please, hear me out.

The CAP Theorem states that a distributed database can only have two of the following qualities:

consistency (writes are in order, and you'll get the same answer from multiple servers at any point in that order)
availability (if a database node is running, you can always write to it)
partition-tolerance (even if database nodes can't communicate with each other, you can still read the data)

You've heard "fast, good, cheap: pick two" — right? Essentially, for any given data type, you get to pick two.

In practice, there are only two realistic choices for any given database: consistence and partition-tolerance ("CP") or availability and partition-tolerance ("AP"). "CA" doesn't really exist, because without partition tolerance, you don't have availability. At least Kelly Sommers and Aphyr agree.

Some databases assert that the CAP theorem doesn't apply to them. To see what happens when databases make claims that violate the CAP theorem, have a look at Aphyr's Call Me Maybe series. Not all of these databases are bad though, sometimes they're simply claiming a tuned CP like FoundationDB or a way of resolving consistency for you (which you still need to be cautious and aware of).

What You Get With CP Databases

Most SQL databases claim CP. This means that they guarantee consistency. You can still read during bad partitions, and you'll probably write to single place for any given table. But they're not available. If the master is down, and another hasn't been elected, you can't make writes.

Most SQL databases scale very high for reads, because that load can easily be distributed, but they don't scale horizontally for writes while keeping consistency; a master-master SQL cluster must resolve write conflicts with timing or app intervention. SQL servers can become unavailable for writes during high load of the master servers and outages that only effect master nodes. If your data is primarily about reading (a product database, a user database, etc.), then SQL is an excellent choice, and will scale very well for you.

But let's say your use case is collaboration. With an SQL database, you'll soon have to start sharding your users and interactions to separate clusters to keep up with writes. For collaboration use cases, your users are writing just as much as reading.

Considering AP Databases

It may be time to look at an AP database.

Some of the best AP database clusters are based on the Dynamo whitepaper by Amazon.

A database cluster that claims availability means that a large selection of nodes will need to be offline before you lose the ability to write and read a given piece of data. Generally databases with the availability claim work best when any given piece of data exists in multiple data-centers. An AP database also has partition-tolerance: the nodes can stop talking to each other, and your data is still there to read and write.

As the CAP theorem implies, the big downside to AP database clusters is the lack of consistency. This means that for any given piece of data, you might get different results depending on which node you ask. Generally these databases gather results, so that each node will eventually all agree on multiple versions of the truth.

This isn't as bad as it sounds.

Lack of consistency happens in an AP cluster when a single piece of data has been written to in two different places, either in close proximity to time, or during a net-split (when the two nodes were unable to communicate due to network failure). Some databases have settings and even defaults for resolving these cases with the Last-Write-Wins solution. Keep in mind, however, that you're destroying data from a confirmed write.

The best solution for lack of consistency is in your application itself, perhaps in the API service, for example. When you read a piece of data and get multiple results back, you should have a function for resolving consistency problems with that data-type. Merging the data, picking one over another, or creating a new data object are all possible solutions, and should be hand selected for each data type.

Hybrid Solutions

In the end, you may want to mix a single state server, a CP cluster, and an AP cluster together for the different data types that you use. This hybrid approach can help you strike the right balance when your requirements are complicated by the limits of the CAP Theorem.

Since this is my blog post, I'll tell you about my personal favorites:

Standalone, non-clustered data, snapshot backups, slaving: Redis
Consistent, partition-tolerant, high reads, low writes: Postgres
Available, partition-tolerant, horizontal scaling: Riak

I'm sure I'll have different favorites tomorrow. :-)

I've also been working on Dulcimer, which &yet uses to develop locally against an embedded leveldb, and deploy and scale against Riak (Postgres support is in progress). Dulcimer aims to be a single way of dealing with a wide variety of key-store technologies.

In the end, choose the best tool for the job, and don't get wedded to any specific piece of technology.

If you have in-depth questions about choosing the right database for your needs, consider our architecture consulting services.

Feel free to comment directly to Nathan Fritz @fritzy.

&yet

The Blog