Riak, protocol buffer encoding, and you

Protocol buffer encoding is hard.

I really wanted to use them, though, seeing as there's a pretty significant speed increase when you don't have the overhead of HTTP.

Unfortunately, no one had written a node.js library for it. A couple of C bindings existed, but when I tried to use them, they either didn't even compile or I couldn't get them to work. That's when I had one of my all-too-common breakdowns, and decided to write my own. After all, anything for the sake of increased performance, right?

Using Google's specifications, I got started. In order to use protocol buffer encoding in any language, you have to start by writing a definition file to describe what messages exist, and what they contain. That definition is used for both encoding and decoding packets.

Obviously parsing these definitions is important, so that's where I started. The format of these files is reasonably consistent, so figuring out how to translate them into javascript objects wasn't too difficult. A couple of hours worth of work, and one big ugly loop later, I was feeling pretty good about myself.

Once I had the definition parsed, I could start on writing actual data. This took a bit more effort; the protocol defines several different data types, and each of them is stored in its own unique way. To add to the confusion, you can also specify both repeated and optional elements.

Since I was only writing this library to use with Riak, I decided to only use the data types that Riak uses in their definitions. Luckily for me, there are only two, varints and bytes. But what the heck is a varint? Back to the documentation I went. I learned that the varint is a fancy way to serialize numbers. According to Google's documentation:

"Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first."

Wow, that's not confusing at all. I had to take some time for my brain to wrap around this one. Eventually, it came to me. Split your number up into 7 bit chunks, reverse the order, then set the first bit of each group to 1, except for the very last one. If the number you serialize is less than 127 (that's 7 bits), this is very easy since the result is just the number itself.

Need to serialize 10? No problem, it's 10. That's the answer. 10. But what if you need to encode something much larger, say 500? This is where things get tricky. First things first, let's represent this in bits so it's easier to visualize:

111110100

Now we know we can only use 7 bits out of each byte since that first one is reserved, so let's break that into two almost bytes:

0000011 1110100

Now, you'll remember the documentation says "least significant group first." Well, that just means things are backwards, so let's flip those around:

1110100 0000011

Okay, but we also have to tell the varint that we're using multiple bytes, so that means we have to set the most significant bit on the first byte to a 1. Tack a 1 on to the beginning of the first byte, and a 0 on the beginning of the second, taking them from almost bytes to actual bytes, and we've got our varint representation of the number 500.

11110100 00000011

Wee, bits!

Deserializing is the reverse, the first byte that doesn't have the most significant bit set is the last byte of your number.

Drop the first bit of everything, and reverse all the bytes. Tada, there's your number.

If you want to see what the code that actually does this in protobuf.js looks like, check out my butils project at https://github.com/nlf/node-butils.

Anyway, now that I've probably lost 90% of everyone that was reading this with confusing math stuffs, I'll also explain how bytes or strings work. They're a much simpler critter – first we varint serialize the length of the string, then we append the bytes. That's it.

Now, remember how I said that protocol buffer applications have a file that describes the available messages?

Well, we have to store that information in the data stream too, so we know what data we actually have. How do you do that?

Easy peasy.

Each message is made up of several fields, each field has a number assigned to it as well as a type. An example from Google's documentation:



message Test1 { required int32 a = 1; required string b = 2; }

Here, there are two fields, an int32 (that's a varint) named "a" at number 1, and a string named "b" at number 2. To build the header for each piece of data, you store the type of the data in the last three bits, and the field number in the other 5. Field number 1 is a varint, which is numerically a type of 0. So the header for that field would look like:

00001000

Field 2, however, is a string which is numerically a type of 2, so its header would look like this:

00010010

Append the actual data to the header, and you've got a protocol buffer encoded message. Good for you!

For repeated fields, you simply use the same header on each item. Optional fields, if they aren't there just don't add them to the data stream. Now we've got all of the protocol buffer implementation that we need.

But what about Riak? I guess that'll be a separate library. Let's call this one protobuf.js, and move on shall we?

Riak prepends data messages with its own header. Their header is the 32 bit length of the internal Riak message code added to the length of the protocol buffer message, then the 8 bit message code itself, and the actual protocol buffer message. There's a small table of message codes in the Riak documentation, so this was really easy to add.

Now we know how to encode data into protocol buffers, we know how to decode that data, we know how to add the appropriate headers, all that's left is to slap an interface on it and we've got a client!

The client simply translates the function call to a Riak message code, so client.get() becomes RpbGetReq. We encode the parameters as a protocol buffer message, we prepend the Riak header, and send it off to the server. The reply comes back with the message code RpbGetResp, so we make sure we've got the right length of data (check your message length in the header!) then drop the Riak header, and decode the protocol buffer message.

That's all there is to it. The Riak client portion of the code is relatively simple. A little connection management magic, some packet queueing to make sure we don't send another request before we get a response for the last one, and it's done.

And that, my friends, is how a Riak protocol buffer library is born.

After I "finished", I went ahead and published what I had to npm, and basically forgot about it. Several months later, I got a pull request from Mathias Meyer (@roidrage on twitter and github) fixing a bug.

Then another.

After the third pull request, he told me that it's because he intended to integrate riakpbc as an alternate to the HTTP backend in riak-js.

We collaborated some more over the next several weeks, and as of version 0.10 riak-js supports Riak's protocol buffer interface through riakpbc.

Now that riakpbc and protobuf.js are being used in an already fairly widespread project, it's quite a bit more likely that bugs will be found and features will be requested.

I've seen a report from a user who was able to gain a 30% performance increase in his application, just by using protocol buffer encoding instead of HTTP, who wouldn't want that kind of performance for free?

You might also enjoy reading:

Blog Archives: