We had been using 2i for most of our querying and a combination of "Data Point Objects" and MapReduce for our more analytical needs. However when our MapReduce started bombing we questioned/reviewed our querying approach. (for the record, it was bombing due to pref_list exhausted, currently resolved. more on this in a subsequent post)
It didn't take long to find posts like this on Google:
Be aware that key filters are just a thin layer on top of full-bucket key listings. You'd be better off storing the field you want to filter in a secondary index, which more efficiently supports range queries (note that only the LevelDB storage engine currently supports secondary indexes). Barring that, you could use the special "$key" index for a range query on the key. - Sean Cribbs (http://riak-users.197444.n3.nabble.com/preflist-exhausted-error-td3935922.html)Key Filters are the backbone of our querying approach. We were/are under the impression this is the best way to get a subset of the data for MapReduce. Since most of our queries are over date ranges all of our keys are prefixed with yyyymmddhhmmss. This allows us to use date based Key Filters in our Map Reduce. According to the post however, we should be using 2i and the special $key field for performance reasons.
First of all, what is $key??? Here is what the Riak Handbook has to say about the $key field:
So I put it to the test. For the record we are using Ruby and Rails and the Ripple(ODM) with Ripplr (RiakSearch). I opened a rails console and ran my Key Filter based MapReduce. It returned in 31,830 rows in a little over a minute on average. I then queried the same bucket using 2i and $key with the same parameters with no Reduce phase. It ran for over 3 minutes before I canceled it! For a sanity check I shrunk the range to two months that I knew had much less data ... it returned results over a minute later!
The code for our Key Filter:
The code for the comparable 2i range query:
So my question to all of you is why is 2i/$key being recommended over Key Filters? What am I missing here?
As a side note ... Riak handled an extraordinary spike in load like a champ writing and reading data. Our concerns strictly revolve around how best to query that data now that we have it.