Key-Value Stores in Ruby (Key-Value Stores Part 1)

Key-value stores have emerged in the last few years as an alternative to relational database management systems for some data storage tasks. SimpleDB, BigTable, and Tokyo are just three of the variants. So what are they? How are they different than relational databases? And when should you use them?

Applications, whether web apps, simple dynamic websites or command line apps, frequently need some sort of persistent data store. As a result, databases have become ubiquitous on modern systems, and because of this chicken and egg relationship, programmers will often habitually reach for a relational database when the project only calls for a way to persist data.

Generally, this is fine. Relational databases are a well known technology, with numerous features that make life easy for the people who use them. And while you can quibble about the nuances, relational databases typically do a decent job with most data storage and data query tasks.

Once in a while it behooves a programmer to reach for a different tool. Perhaps you need something that can query simple data faster than a relational system can. Maybe there are scaling requirements that are difficult to meet with a relational system. Or you want to avoid tying an application to a requirement for a relational system that may well have its own maintenance needs over time. Or maybe you just want something simpler than a relational database. In all of these cases, a key-value store may be just the tool for you.

In the simplest terms, a key-value store is exactly what the name suggests: it’s a storage system that stores values, indexed by a key. The implementations are diverse, but they’re usually built on either a hash data structure, or a tree data structure, like a b-tree. Hash based key-value stores have fast lookup and update speeds, and require that keys be unique. Lookup in a hash based tree is limited to key-only lookup.

Tree based stores generally allow multiple identical keys, and because tree based structures (like the b-tree) are ordered, they allow you to query individual keys, as well as ranges of keys. The drawback to tree based stores is that those structures are generally slower than simple hashes. For example:

A log aggregation system that stores messages keyed by timestamp, but which uses a hash based store, will be troublesome to query in a useful way, despite the speed of lookup, because you can only lookup specific timestamps. A system that uses a b-tree based store may be marginally slower for individual queries, but by permitting lookup based on a range of timestamps, can easily query the records of interest. This more than offsets the difference in basic query speeds.

Ruby comes with a simple hash based key-value store called PStore. If you’ve never used it, it’s definitely worth learning. For simple data storage, PStore can be very useful.

Let’s walk through a simple example: assume you get a daily feed of stock data, and want to load it into a simple PStore database so that the data can be displayed on a web site. A loader might look like this:

stock_loader.rb

require 'pstore'
require 'csv'
require 'date'

store = PStore.new('fund_data')
funds = {}

store.transaction do
  store.roots.each { |root| funds \[root\] = store.root }

  CSV.read(ARGV.first).reject {|x| x.first.nil? }.each do |row|
    row \[0\] = row \[0\].to_s.strip

    record = {
      :ticker => row \[0\],
      :rate_date => Date.parse(row \[2\]),
      :price => row \[3\],
      :price_change => row \[4\],
      :percent_change => row \[5\]
    }
    store[row \[0\]] = record
  end
end

A PStore data store acts just like a hash, with the exception that everything done with the store has to be done inside the context of a transaction. Creating a new store is simple:

store = PStore.new(FILENAME)

With that one line, a key-value store with hash-like behavior (which stores its data on the filesystem in a file named FILENAME) is created. Accessing a PStore key-value store is conceptually very similar to accessing a Hash:

store['footloose'] = 'dance like it's the 1980s!'
hash['footloose'] = 'dance like it's the 1980s!'

Both lines of code have similar behaviors; they both store a value that is indexed by a key. Likewise:

line = store['footloose']
line = hash['footloose']

In both cases, a value is queried from the data store based on the key (footloose) that was provided. The most significant difference between a Hash and a PStore is seen a couple lines farther down in the code.

store.transaction do

PStore requires that all interaction with the store happens within the context of a transaction, while regular Hash interactions have no such restriction. This guarantees, in a simple manner, that two separate processes that are both accessing the same store will not have collisions with any modifications. Only a single transaction can be active at any one time, and all changes to the store are made persistent only when the transaction is closed (or committed). Changes are discarded if the transaction is rolled back. It’s simple, with convenient, hash-like access, and just a few useful database-like features.

Once a loader similar to the one above runs, the key-value store will be populated with data, and your website can pull this data just as trivially as it was stored: /app/controllers/fund_table_controller.rb

class FundTableController < ActionController::Base
  before_filter :setup
  def index
  end

  private
  def setup
    @store = Pstore.new('fund_data')
  end
end

/app/views/helpers/fund_table_helper.rb

module FundTableHelper
  def data
    @store.transaction { @store \[@ticker\] }
  end
  def tickers
    @store.transaction { @store.roots.sort }
  end
end

/app/views/fund_table/index.html.haml

%table
  %thead
    %tr
      %th ticker
      %th price
      %th change
  %tbody
    - tickers.each do |@ticker|
      %tr
        %td= data.ticker
        %td= data.price
        %td= data.price_change

The web application connects to the store and then opens transactions and reads the desired data - it’s all very simple.

While PStore lets you store anything that can be marshaled, it does have some notable performance restrictions. It’s a simple piece of kit, so when a transaction is opened on a store, it doesn’t do anything sophisticated; it reads the whole store into memory. This makes it useless for huge data sets that you would not or could not otherwise store in RAM, and it also makes opening and closing any large data store an expensive operation. Whether the expense is greater than that incurred by a traditional relational database will depend on the individual use cases, but for larger data sets it becomes completely untenable as a solution. In those cases, you need to look a relational database or more sophisticated key-value stores.

Syntax for different key-value stores will obviously differ, but they all function in a manner conceptually similar to PStore. They all allow storage of some arbitrary data, indexed for retrieval by a single key. For data which is amenable to this sort of storage and retrieval pattern (and a lot of data used in web applications and dynamic web sites is), key-value stores will generally be much faster than relational databases and can keep your code looking clean and simple.

In my next post I’ll introduce Tokyo Cabinet, which is one of the more popular options in a new crop of key-value stores. Tokyo Cabinet is fast, handles large data sets well, and is available with add-ons to let you interact with it over the network, like a traditional relational database, and to use it for full text indexing. Also, Tokyo Cabinet has bindings for several languages other than Ruby, which makes practical as a data store in a polyglot environment.

Subscribe to this blog or follow @engineyard on Twitter to be notified of the next key-value store post.