MongoDB: A Light in the Darkness! (Key Value Stores Part 5)

The universe was dark and chaotic. Bits of broken matter swirled everywhere, illuminated by flashes of explosive light, and the rare gleam of something brighter and more persistent. Those bright lights of persistence always seemed to be shrouded in a miasma of cosmic dust. Then it happened!

A twist of gravimetric interplay pulled two of these lights towards each other, where they swirled and danced for a time prior to crashing into each other. That cosmic convergence showered the surrounding space with illumination as the resulting maelstrom of persistence coalesced towards stability and slashed through the miasma, shining a new light on the cosmos. That new light of persistence was good, and was called MongoDB.

MongoDB can be thought of as the goodness that erupts when a traditional key-value store collides with a relational database management system, mixing their essences into something that’s not quite either, but rather something novel and fascinating.

MongoDB is a document-oriented database. If you haven’t used one before, that may sound strange, but it’s really pretty simple. A document is a set of keys and values that, together, represent a larger set of data. Conceptually, it’s a lot like a table with a free form schema. If you have used Tokyo Cabinet tables, they are functionally similar. It’s a very useful paradigm because it allows you to store and then access your data in a simple, direct, and flexible way.

Installing MongoDB is simple. Just hit http://www.mongodb.org/display/DOCS/Downloads, and download the appropriate package for your platform. Then:

mkdir -p /data/db
tar -xvzf PACKAGE

./mongodb-xxxxxxx/bin/mongod &

At that point, you have a running instance of MongoDB. Now, try a simple interaction with it:

./mongodb-xxxxxxx/bin/mongo

 > db.foo.save( { a : 1 } )
 > db.foo.findOne()

Awesome! You’re off to the races.

MongoDB support is available in many languages, making it a good choice for a system that has to work in a polyglot environment; all of the major languages have support. The Ruby package is a gem known as mongodb-mongo. To install it, first make sure rubygems knows that gems.github.com is a valid source for gems: gem source --list

Add gems.github.com if it isn’t shown in that list: gem source --add http://gems.github.com

Then install: gem install mongodb-mongo

Or, if you want to install the version that uses a C extension for better performance: gem install mongodb-mongo_ext

Using MongoDB is Simple

>> require 'rubygems'; require 'mongo'
=> true
>> include XGen::Mongo::Driver
=> Object
>> db = Mongo.new.db('finance')
=> #<xgen::mongo::driver::db:0x2a98da7038 @socket=#<TCPSocket:0x2a98da5be8>, @port=27017, 
   @auto_reconnect=nil, @semaphore=#<object:0x2a98da6ed0 @mu_waiting=[], @mu_locked=false>, @name="finance", 
   @nodes=[ \["localhost", 27017\]], @host="localhost", 
   @strict=nil, @pk_factory=nil, @slave_ok=nil>

>> collection = db.collection('stocks')
=> #<xgen::mongo::driver::collection:0x2a98d94208 @name="stocks", @hint=nil, 
   @db=#<XGen::Mongo::Driver::DB:0x2a98da7038 
   @socket=#<TCPSocket:0x2a98da5be8>, @port=27017, 
   @auto_reconnect=nil, @semaphore=#<object:0x2a98da6ed0 @mu_waiting=[], @mu_locked=false>, @name="finance", 
   @nodes=[ \["localhost", 27017\]], @host="localhost", 
   @strict=nil, @pk_factory=nil, @slave_ok=nil>>

>> stock = {'ticker' => 'GOOG',
>> 'Google Inc.',
>> '38259P508',
>> 'http://www.google.com/finance?q=goog'}
=> {"reference"=>"http://www.google.com/finance?q=goog", 
    "name"=>"Google Inc.", "cusip"=>"38259P508", "ticker"=>"GOOG"}

>> collection.insert stock
=> {"reference"=>"http://www.google.com/finance?q=goog", 
    "name"=>"Google Inc.", "cusip"=>"38259P508", "ticker"=>"GOOG"}</object:0x2a98da6ed0></xgen::mongo::driver::collection:0x2a98d94208></object:0x2a98da6ed0></xgen::mongo::driver::db:0x2a98da7038>

That’s all there is to it. Just insert your hash representation of your document, and it’ll be stored for you. To retrieve one or more documents, use the #find method:

>> cursor = collection.find('ticker' => 'GOOG')
=> #<xgen::mongo::driver::cursor:0x2a98d28940 @closed=false, 
   @query=#<XGen::Mongo::Driver::Query:0x2a98d28af8 
   @order_by=nil, @fields=nil, @number_to_return=0, 
   @selector={"ticker"=>"GOOG"}, @hint=nil, 
   @number_to_skip=0, @explain=nil>, @rows=nil, @cache=[], 
   @query_run=false, @num_to_return=0, @can_call_to_a=true, 
   @db=#<xgen::mongo::driver::db:0x2a98da7038 @socket=#<TCPSocket:0x2a98da5be8>, @port=27017, 
   @auto_reconnect=nil, @semaphore=#<object:0x2a98da6ed0 @mu_waiting=[], @mu_locked=false>, @name="finance", 
   @nodes=[ \["localhost", 27017\]], @host="localhost", 
   @strict=nil, @pk_factory=nil, @slave_ok=nil>, 
   @collection=#<xgen::mongo::driver::collection:0x2a98d94208 @name="stocks", @hint=nil, 
   @db=#<XGen::Mongo::Driver::DB:0x2a98da7038 
   @socket=#<TCPSocket:0x2a98da5be8>, @port=27017, 
   @auto_reconnect=nil, @semaphore=#<object:0x2a98da6ed0 @mu_waiting=[], @mu_locked=false>, @name="finance", 
   @nodes=[ \["localhost", 27017\]], @host="localhost", 
   @strict=nil, @pk_factory=nil, @slave_ok=nil>>>

>> cursor.next_object.inspect
=> "{\"_id\"=>#<xgen::mongo::driver::objectid:0x2a98ce9448 @data= \[74, 184, 252, 71, 34, 116, 195, 23, 83, 115, 44, 164\]>, 
   \"reference\"=>\"http://www.google.com/finance?q=goog\", 
   \"name\"=>\"Google Inc.\", \"cusip\"=>\"38259P508\", 
   \"ticker\"=>\"GOOG\"}"</xgen::mongo::driver::objectid:0x2a98ce9448></object:0x2a98da6ed0></xgen::mongo::driver::collection:0x2a98d94208></object:0x2a98da6ed0></xgen::mongo::driver::db:0x2a98da7038></xgen::mongo::driver::cursor:0x2a98d28940>

As you can see in the example above, #find is simple. It takes a hash which describes keys to search, and the values in them to search for. It returns a cursor object that can be enumerated in order to retrieve the return results. So in a case where you have many records that were returned as the result of a find operation, you could do something like this:

collection.find('price_date' => '2009-09-21').each do |stock|
  # do stuff with stock
end

If your query should only return a single data item, or you only care about the first of a set of data that might match, you can use #find_first, like this:

>> collection.find_first('ticker' => 'GOOG')
=> {"_id"=>#<xgen::mongo::driver::objectid:0x2a98ccd090 @data= \[74, 184, 252, 71, 34, 116, 195, 23, 83, 115, 44, 164\]>, 
   "reference"=>"http://www.google.com/finance?q=goog", 
   "name"=>"Google Inc.", "cusip"=>"38259P508", "ticker"=>"GOOG"}</xgen::mongo::driver::objectid:0x2a98ccd090>

Notice in the above set of returned data that there is one additional field that is added to the record. MongoDB reserves all fields that start with the __ character for internal use. The _id field is a unique identifier for that row of data. It receives special indexing and treatment by MongoDB in order to make many db operations more efficient.

So, if you’re like me, you’re looking at these examples and wondering how you move beyond find_first(FIELD => VALUE), which is obviously limited to searching only for exact matches. MongoDB has you covered:

  • Boolean searches: collection.find({'price' => {'$gt' => 10.00}})
  • Regular expressions: collection.find({'ticker' => /^MS/})
  • Sets: collection.find({'ticker' => {'$in' => ['GOOG','YHOO']}})
  • Sorting and liming: collection.find({'cusip' => {'$gt' = '580'}}, {:limit => '100', :sort => 'ticker'})

In this way, MongoDB provides much of the query capability of a SQL database.

While you can query the document store on any key, if there are keys that you expect to be doing a lot of queries with, you should create an index on that key. Doing so dramatically increases the speed at which the data can be queried, especially when there is a lot of it. To do so:

>> collection.create_index('key')
=> "key_1"

In addition to its key-value-like storage capabilities, MongoDB has one other interesting capability that I want to reveal. It offers a GridFS storage system that lets people store complete files within the database. The Ruby library for Mongo that provides access to this capability is called mongo/gridfs. It essentially permits you to do file IO into and out of a MongoDB database.

>> require 'rubygems'; require 'mongo'; require 'mongo/gridfs'
=> true
>>  include XGen::Mongo::Driver
=> Object
>> include XGen::Mongo::GridFS
=> Object
>> db = Mongo.new.db('finance')
=> #<xgen::mongo::driver::db:0x2a98d41800 @auto_reconnect=nil, @host="localhost", 
   @semaphore=#<Object:0x2a98d416c0 @mu_waiting=[], 
   @mu_locked=false>, @name="finance", 
   @nodes=[ \["localhost", 27017\]], @strict=nil, 
   @pk_factory=nil, @slave_ok=nil, 
   @socket=#<tcpsocket:0x2a98d40928>, @port=27017>

>> GridStore.open(db,'testfile','w+') {|fh| fh.puts "This is a test."}
=> nil

>> GridStore.open(db,'testfile','r') {|fh| puts fh.read}
This is a test.
=> nil</tcpsocket:0x2a98d40928></xgen::mongo::driver::db:0x2a98d41800>

As you can see, MongoDB is very easy to use. It is not a screaming speed demon like a simple key-value store (such as a Redis or Tokyo Cabinet), but it performs more than adequately. On commodity Linux hardware, tests showed about 2,500 simple document insertions per second, and about 2,800 reads per second using the gem without the C extension.

MongoDB does not have any sharding capabilites that are at production quality, but there is now alpha level support for automatic sharding, so it’s only a matter of time before MongoDB enters the realm of being a fully production ready, horizontally scalable key-value document store.

MongoDB’s charm is that it mixes a very powerful, expansive query model with a free-form key-value-like data store, while still giving adequate performance. It is ideal for storing documents in a database. Query syntax isn’t the prettiest thing around, but with an ease of use that rivals that of Redis, MongoDB should be a strong contender if you have complex data storage needs.