Wednesday, January 16, 2013

Riak: What's faster? 2i or Key Filters

Recently we've experienced a large surge in the size of our data and it's called into question some of our querying approaches and node configuration.

We had been using 2i for most of our querying and a combination of "Data Point Objects" and MapReduce for our more analytical needs. However when our MapReduce started bombing we  questioned/reviewed our querying approach. (for the record, it was bombing due to pref_list exhausted, currently resolved. more on this in a subsequent post)

It didn't take long to find posts like this on Google:
Be aware that key filters are just a thin layer on top of full-bucket key listings. You'd be better off storing the field you want to filter in a secondary index, which more efficiently supports range queries (note that only the LevelDB storage engine currently supports secondary indexes). Barring that, you could use the special "$key" index for a range query on the key. - Sean Cribbs (
Key Filters are the backbone of our querying approach. We were/are under the impression this is the best way to get a subset of the data for MapReduce. Since most of our queries are over date ranges all of our keys are prefixed with yyyymmddhhmmss.  This allows us to use date based Key Filters in our Map Reduce. According to the post however, we should be using 2i and the special $key field for performance reasons.

First of all, what is $key??? Here is what the Riak Handbook has to say about the $key field:
There are special field names at your disposal too, namely the field $key, which automatically indexes the key of the Riak object. Saves you the trouble of specifying it twice. Riak automatically indexes the key as a binary field for your convenience ...
So I put it to the test. For the record we are using Ruby and Rails and the Ripple(ODM) with Ripplr (RiakSearch). I opened a rails console and ran my Key Filter based MapReduce. It returned in 31,830 rows in a little over a minute on average. I then queried the same bucket using 2i and $key with the same parameters with no Reduce phase. It ran for over 3 minutes before I canceled it! For a sanity check I shrunk the range to two months that I knew had much less data ... it returned results over a minute later!

The code for our Key Filter:
The code for the comparable 2i range query:
So my question to all of you is why is 2i/$key being recommended over Key Filters? What am I missing here?

As a side note ... Riak handled an extraordinary spike in load like a champ writing and reading data. Our concerns strictly revolve around how best to query that data now that we have it.

Monday, January 14, 2013

Why CodeMash continues to be the awesome!

What is CodeMash? 

"CodeMash is a unique event that will educate developers on current practices, methodologies, and technology trends in a variety of platforms and development languages such as Java, .Net, Ruby, Python and PHP." - 

This was CodeMash's 7 year and my 4th year in attendance. While the CodeMash description is accurate, what it fails to describe is the CodeMash culture. Education does not just happen during speakers' sessions or precompilers. It also occurs while sitting down and having a brew with a signatory of the Agile Manifesto or while having a casual conversation about Rails 4.0 with a core contributor. It even occurs while sitting in the hot tub/swim up bar with other agile coaches talking about real world scenarios. There are no rules at CodeMash, except one ... be yourself. Oh, and maybe lots of crispy bacon.

2013 tech trends at CodeMash

I tend to be a Ruby centric developer these days, but these were the buzzwords I continued to hear while at CodeMash.
  • JavaScript
  • Gamification
  • Single Page Applications
  • Bacon Bar
  • and more Javascript


It was extremely difficult to not attend a talk that did not mention Javascript. Several frameworks have evolved around JavaScript including AngularJS, Backbone, Node, and Knockout. Testing for Javascript has also matured significantly with tools like Jasmine and lineman (see my notes on the Test Double talk).
On a similar note, CoffeeScript is the language of choice for crafting Javascript as it compiles down to and promotes best practice Javascript code that is 99.9% guaranteed to work in IE! And you don't need to work in Rails to use CoffeeScript. You can use the coffee command line tool to watch a directory and have it (re)compile your CoffeeScript as you make changes to it.  Keep reading for more on testing in Javascript!

Gamification and Single Page Applications

Dennis and Brian from SRT Solutions have crafted an application for exploring different ways to write single page applications. If you have ever read a choose your own adventure book, you're going to love their application titled Choose your own application. The focus is on building your own single page application with your choice of technologies. The application has been "gamified" and "players" earn badges for each choice they make. This is a great opportunity to explore a new technology in a fun way. Technology choices include Backbone, Knockout.js, .NET, Rails, Node.js, Heroku, CoffeeScript, and Azure.
Rails and Single Page Applications ... with the release of Rails 4.0, Rails will be adding in default support for Single Page Applications (TurboLinks). Currently it can be disabled by removing the TurboLinks gem from the Gemfile, otherwise you will need to disable it on a per link basis. DHH has stated he intends to drive rails in the direction which is best for BaseCamp, a single page application, so expect more changes like this in the near future. My $.02, expect a community fork of Rails in the near future
Brian Prince delivered an excellent talk on Gamification. He discussed several real world examples where Gamification has led to modifying user behavior, including applications that encourage diabetics to test insulin levels regularly and elderly people living at home alone to stay active and engaged. The important thing to remember is to identify the behavior you want to change and then gamify that aspect of your application to encourage that behavior. Adding badges for the sake of adding badges often results in encouraging the wrong behavior.

And Gamification does not just mean badges. Take the bottomless trash can for example. It changes behavior by encouraging them to put their trash into a trash can. And when they do it sounds like their garbage is falling into a deep chasm. It's fun and gets people to do it again. They actually found people looking for trash nearby just to throw in it!

Machine Learning

Seth Juarez delivered two excellent presentations on Machine Learning. Machine learning allows us to find and exploit patterns in data. There are two main classifications of machine learning, supervsed and unsupervised. Supervised learning allows us to be predictive while unsupervised learning helps us to understand the structure of the data. For more details, read my notes from Seth's talks (part 1, part 2).
Seth also has a NuGet package that can be imported into Visual Studio. It is called NuML and can be found here. It was demoed during his talk and looks awesome! As the number of Big Data projects grow, this is going to increasingly become a more and more common topic for discussion and application.

Real world Javascript testing

Javascript testing has really improved since I last looked into it. Jasmine appears to be the front runner and from what I saw and experienced is my prefered choice. It looks alot like rspec and can use the rspec-given syntax thanks to Justin Searls and Jasmine-Given. Justin demonstrated a combination of tools that makes testing Javascript extremely easy. Lineman is one of those tools and requires Node.js and NPM in order to install it. Lineman is used to run your Jasmine specs. You can read more about Javascript testing in my detailed notes on his talk.

Better Metrics for your team

Nayan Hajratwala gave a fantastic demonstration on measuring your team's effectiveness. Traditionally teams have been measured by cyclomatic complexity, velocity, hours in office, etc. However, none of those answer what the customer really wants to know ... What is the team's throughput?
Throughput is the rate at which features are passing through the system. Most often teams try to deliver more by putting more work in progress into the system. This often results in lower quality, bottlenecks, and overall lower throughput.
Cycle Time is the time between two succesfully delivered features and applies Little's Law to compute. Little's Law is described as:
The average number of work items in a stable system is equal to
their average completion rate, multiplied by their average time in the
To demonstrate, Nyan created 4 teams and had each team play "The Dot Game". The game has the team divided up into 8 roles and the team measures how fast they can assemble the "product". The demonstration showed that adding more work in progress only resulted in less being delivered. Nyan then changed the rules of the game such that there was less work in progress by requiring each role to only work on one product at a time and repeated the exercise. Each of the 4 teams saw an average of 8x improvement in Cycle Time, a huge improvement in quality, and increase in the amount of product produced.
The goal should not be 100% utilization of workforce, it should be maximizing throughput. This demonstration showed that by minimizing work in progress and having each role focus on one thing at a time resulted in less than 100% utilization, but it also resulted in much higher throughput and higher quality.

Bacon Bar!

Several stations were assembled, each with their own mouth watering trays of bacon and selection of toppings. 350lbs of bacon were consumed in a very short amount of time and no heart attacks were reported. Thanks to Josh Walsh and Designing Interactive for coming up with this great idea and sponsoring the activity!!

I was however surprised that Duct Tape beat Bacon 34-29 in the first round of Manifest's MashMadness. Duct Tape even went on to beat Gandolf the Grey in the championship round. Gonzaga??