Wednesday, January 16, 2013

Riak: What's faster? 2i or Key Filters

Recently we've experienced a large surge in the size of our data and it's called into question some of our querying approaches and node configuration.

We had been using 2i for most of our querying and a combination of "Data Point Objects" and MapReduce for our more analytical needs. However when our MapReduce started bombing we  questioned/reviewed our querying approach. (for the record, it was bombing due to pref_list exhausted, currently resolved. more on this in a subsequent post)

It didn't take long to find posts like this on Google:
Be aware that key filters are just a thin layer on top of full-bucket key listings. You'd be better off storing the field you want to filter in a secondary index, which more efficiently supports range queries (note that only the LevelDB storage engine currently supports secondary indexes). Barring that, you could use the special "$key" index for a range query on the key. - Sean Cribbs (
Key Filters are the backbone of our querying approach. We were/are under the impression this is the best way to get a subset of the data for MapReduce. Since most of our queries are over date ranges all of our keys are prefixed with yyyymmddhhmmss.  This allows us to use date based Key Filters in our Map Reduce. According to the post however, we should be using 2i and the special $key field for performance reasons.

First of all, what is $key??? Here is what the Riak Handbook has to say about the $key field:
There are special field names at your disposal too, namely the field $key, which automatically indexes the key of the Riak object. Saves you the trouble of specifying it twice. Riak automatically indexes the key as a binary field for your convenience ...
So I put it to the test. For the record we are using Ruby and Rails and the Ripple(ODM) with Ripplr (RiakSearch). I opened a rails console and ran my Key Filter based MapReduce. It returned in 31,830 rows in a little over a minute on average. I then queried the same bucket using 2i and $key with the same parameters with no Reduce phase. It ran for over 3 minutes before I canceled it! For a sanity check I shrunk the range to two months that I knew had much less data ... it returned results over a minute later!

The code for our Key Filter:
The code for the comparable 2i range query:
So my question to all of you is why is 2i/$key being recommended over Key Filters? What am I missing here?

As a side note ... Riak handled an extraordinary spike in load like a champ writing and reading data. Our concerns strictly revolve around how best to query that data now that we have it.

No comments:

Post a Comment