Using PostgreSQL hstore in a Rails application on Engine Yard Cloud
This article is a medium-depth introduction on how to use the PostgreSQL hstore extension in a Rails application, and how to get that application deployed and serving queries on Engine Yard Cloud. The content here is aimed at intermediate to advanced developers. You should have some knowledge of ActiveRecord, Rails, Ruby and SQL, as well as some familiarity with unstructured data, and why it’s useful.
What is hstore?
PostgreSQL provides an extension called “hstore”, which is a form of key/value storage system for unstructured or semi-structured data in a column in your database. Generally speaking, if you’ve used serialization in Rails to store things before, you can do the same thing with more efficiency using the hstore extension.
You can do more than just store unstructured data, however. You can query the data inside hstore fields without needing a complete map/reduce implementation (think various NoSQL databases). This is very useful for many things – for example, with a simple query, you could quickly find out which items most often appear in the semi-structured fields.
Bear in mind that hstore isn’t a replacement for a NoSQL database. Proper use of hstore can help prevent an application from breaking and allow developers greater flexibility as a project ages, but only when used judiciously with a full understanding of how the mechanism works, including the indexes that will need to be built in order to query the data.
Potential Applications
What should you use hstore for? As I see it, hstore is suitable for use in cases where you have most of your data normalized, but you may have bits of data related to your models may need to change frequently and you don’t want to run database migrations all that often, thus changing the structure of your database and possibly requiring your site to come down.
For example, maybe you’re consuming information from an API that may change its responses without providing a backwards-compatible, unchanging structure (inserting new keys in its JSON responses, perhaps). This may not be all that common with public APIs, but with internal projects this can happen rather frequently. Perhaps you wouldn’t want to discard this data, but without writing a migration and making room for new data, you won’t be able to keep it.
Another possibility could be user preferences – you don’t necessarily want 10-30+ rows per user in a giant join table for user preference data and perhaps you feel that should be stored with the user object, but don’t want to go through the trouble of migrating your database (and incurring possible downtime) when deploying changes to user preference structure.
Enter hstore.
With this extension active and in use, you could simply define changes to the structure without a migration and redeploy to pick up changes immediately. That API could throw whatever it wants at you because you can just take the JSON and shove it into hstore. And that myriad of user preferences? No big deal – just store the stuff that differs from the defaults as a key/value pair and you’re done. Change it all you like by just committing new code.
Getting Started
For the purposes of this article, we’re going to limit our scope to just PostgreSQL 9.2 and Rails 3.2.x (though Rails 4 will reportedly have hstore support). Depending on your platform and how you installed PostgreSQL, you may need to take additional steps to get hstore working on your local dev machine. (We’ll cover how to get hstore working on Engine Yard Cloud later in this article.) The general idea is to make sure you have the hstore PostgreSQL extension available to your installation and then to create the extension on the database for your application.
For those of you on OS X who installed PostgreSQL via homebrew, all you need to do is create a database, login and CREATE EXTENSION HSTORE.
Let’s take this from the top. Assuming you’ve installed PostgreSQL and have a basic database cluster running, we can login and create the database:
psql -h localhost -d postgres
Type "help" for help.
postgres=# CREATE DATABASE rails_hstore_dev OWNER dev;
CREATE DATABASE
postgres=# CREATE DATABASE rails_hstore_test OWNER dev;
CREATE DATABASE
Here I’ve logged into the default “postgres” database as my existing user on my development computer, which happens to be the superuser for this database (default setup with homebrew if memory serves). From there I’ve told PostgreSQL to create two databases: rails_hstore_dev and rails_hstore_test, both owned by the user “dev”, which I set up previously.
TIP: You can see which databases exist on the machine by issuing \l (lower case L) at the psql prompt.
Next, connect to the newly created databases and create the extension:
rails_hstore_dev=# \c rails_hstore_dev
You are now connected to database "rails_hstore_test" as user "jaustinhughey".
rails_hstore_test=# CREATE EXTENSION HSTORE;
CREATE EXTENSION
The \c command, short for “connect”, is how you can pseudo-switch between databases in psql. Here I’ve connected to the development database and created the hstore extension. Repeat for the _test database as well, then verify that you can query based on hstore datatypes:
rails_hstore_test=# SELECT 'a=>1, mykey => myvalue
hstore
------------------------------
"a"=>"1", "mykey"=>"myvalue"
(1 row)
Here you can see that I’ve made a simple dummy selection and specified the datatype as hstore. This shows us that PostgreSQL is ready to work with hstore data.
Rails integration and the activerecord-postgres-hstore gem
Let’s switch gears and look at how to integrate this with Rails. While database agnosticism is something of a pipe dream with this specific feature, since it’s unique to PostgreSQL, we can still keep things clean and consistent by writing proper migrations to enable the extension prior to running other migrations and/or seeding data.
To accomplish this, we’re going to make use of the activerecord-postgres-hstore gem:
https://github.com/engageis/activerecord-postgres-hstore
http://engageis.github.io/activerecord-postgres-hstore/
This particular gem adds the ability to work with hstore fields in your models and throughout your Rails application. I suggest looking at the readme for this project as it’s very informative and warns you about a few potential curve balls you may hit.
Keep in mind: you’ll need superuser privileges to issue a CREATE EXTENSION command successfully. In production you’ll likely have your database accessible only from within your security group in EC2 (that’s our default on Engine Yard Cloud, anyway) so multiple layers of user access are generally not necessary since all access is network-controlled. If you absolutely must maintain a non-administrative user for your application’s database, you can run psql commands against it directly as a superuser after the database has been created to set up hstore, and then proceed with migrations.
In the case that you want to set up an existing user as a superuser, first use psql to login as an existing superuser:
psql -d postgres -U <superuser> -h <host>
Then use ALTER USER to turn your project’s user into a superuser:
ALTER USER <application username> WITH SUPERUSER;
Then exit the psql console with ‘\q’. Now that user – whomever you specified as <application username> – should be able to CREATE EXTENSION via migrations without trouble.
Next you’ll want to create a migration to enable the hstore extension. To do that you’ll first need to add activerecord-postgres-hstore to your Gemfile and bundle install. With that done, you can generate the migration:
bundle exec rails g hstore:setup
With that step done, take a quick look at the migration so you know what it’s doing:
class SetupHstore < ActiveRecord::Migration
def self.up
execute "CREATE EXTENSION IF NOT EXISTS hstore"
end
def self.down
execute "DROP EXTENSION IF EXISTS hstore"
end
end
In this case the migration is just running a raw statement that’s applicable only to PostgreSQL. Next, just run it: bundle exec rake db:migrate
. When it’s finished you can proceed with building your application.
An Example Rails Application
The rest of this article is going to go through a very basic use case by building a simplified Rails application that makes use of this feature.
Start by setting up a local PostgreSQL installation and initializing a database, setting up your Gemfile to have access to the activerecord-postgres-hstore gem, and generating the extension migration as above.
This is going to be a very simple app that represents one possible implementation of setting and examining user preferences. We won’t be worrying about actual authentication, authorization or identity here just to keep things focused. Instead we’ll just create a basic User model and allow for preferences to be changed on a per-user basis. In reality you’d want this behind admin-level authentication/authorization, but for the purposes of a demo this will work fine.
The application I’m talking about here can be seen on GitHub as an example:
https://github.com/jaustinhughey/rails-hstore-example
Let’s run through the basics really fast:
mkdir rails_hstore
cd rails_hstore
rails new .
Now modify your Gemfile to look like this:
source 'https://rubygems.org'
gem 'rails', '~> 3.2.13'
gem 'pg'
gem 'strong_parameters'
gem 'activerecord-postgres-hstore' # for hstore
group :development, :test do
gem 'rspec-rails'
gem 'factory_girl_rails'
gem 'better_errors'
gem 'binding_of_caller'
platforms :mri, :rbx do
gem "pry"
end
end
# Application server
gem 'puma'
# Gems used only for assets and not required
# in production environments by default.
group :assets do
gem 'sass-rails', '~> 3.2.3'
gem 'coffee-rails', '~> 3.2.1'
# See https://github.com/sstephenson/execjs#readme for more supported runtimes
# gem 'therubyracer', :platforms => :ruby
gem 'uglifier', '>= 1.0.3'
end
gem 'jquery-rails'
And finally bundle install
.
A quick explanation of some of the gems in use:
- pg is the PostgreSQL gem
- strong_paremeters is a prelude to Rails 4’s mass assignment protection in the controller, available in Rails 3.2.x through this gem
- activerecord-postgres-hstore is what allows us to utilize the hstore extensions in the application
- rspec-rails and factory_girl_rails are for tests
- better_errors is a nice gem to have in development for when you run into errors while running the application
- binding_of_caller can be used by better_errors to give you a REPL right in the browser when errors are encountered so I generally include it with better_errors
- On Matz’ Ruby (MRI) and Rubinius (RBX) I’m using the “pry” gem for debugging. It’s a more powerful debugger that can let you “cd” into objects and “ls” to see what’s going on, etc. If you haven’t used it before I highly recommend trying it out.
- puma is a concurrent web server that we’ll use in production. This could also be Unicorn as well – whatever you like. Puma happens to be what I prefer at the moment.
- Note: if you deploy this to Engine Yard Cloud, be sure switch over to “unicorn” if you intend to deploy on Unicorn instead of Rubinius/Puma.
With all these gems installed you’re ready to generate your hstore migration:
bundle exec rails g hstore:setup; bundle exec rake db:migrate
Now that hstore is enabled and ready to go, generate a basic User model:
bundle exec rails g model user
Edit the migration:
class CreateUsers < ActiveRecord::Migration
def change
create_table :users do |t|
t.string :email
t.hstore :preferences
t.timestamps
end
end
end
And migrate your database to create the users table.
Next, we’re going to have to tell ActiveRecord to serialize the preferences column with hstore. Open up app/models/user.rb and add a little code:
class User < ActiveRecord::Base # Use the activerecord-postgres-hstore serializer serialize :preferences, ActiveRecord::Coders::Hstore # Require an email address at the bare minimum. Preferences can be blank. validates :email, :presence => true, :format => { :with => /\A([^@\s]+)@((?:[-a-z0-9]+.)+[a-z]{2,})\Z/ } end
Now that this is finished, let’s just store a little information in there to test it out. Boot up a Rails console and try creating and saving a few objects just to get a feel for things:
$ bundle exec rails c
Loading development environment (Rails 3.2.13)
rubinius-2.0.0.rc1 :001 > u = User.new(email: "[email protected]")
=> #<User id: nil, email: "[email protected]", preferences: {}, created_at: nil, updated_at: nil>
rubinius-2.0.0.rc1 :002 > u.valid?
=> true
rubinius-2.0.0.rc1 :003 > u.preferences = { theme: "black", language: "US English", currency: "USD" }
=> {:theme=>"black", :language=>"US English", :currency=>"USD"}
rubinius-2.0.0.rc1 :004 > u.save
(0.6ms) BEGIN
SQL (21.6ms) INSERT INTO "users" ("created_at", "email", "preferences", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id" [["created_at", Thu, 18 Apr 2013 21:53:36 UTC +00:00], ["email", "[email protected]"], ["preferences", "\"theme\"=>\"black\",\"language\"=>\"US English\",\"currency\"=>\"USD\""], ["updated_at", Thu, 18 Apr 2013 21:53:36 UTC +00:00]]
(2.2ms) COMMIT
=> true
rubinius-2.0.0.rc1 :005 >
As you can see, we’ve created a simple user with the email address [email protected] and some basic theoretical site preferences (theme, language and currency).
At this point you’re probably asking, “so what’s the big deal?” and rightfully so. Here’s the big deal: you can query against this data multiple ways very efficiently.
rubinius-2.0.0.rc1 :011 > @themes = ["black", "white", "red", "blue"]
=> ["black", "white", "red", "blue"]
rubinius-2.0.0.rc1 :012 > @languages = ["US English", "UK English", "Japanese", "Spanish", "German", "Italian", "Portugese"]
=> ["US English", "UK English", "Japanese", "Spanish", "German", "Italian", "Portugese"]
rubinius-2.0.0.rc1 :013 > @currencies = ["USD", "GBP", "EUR", "AUD", "CDN", "JPY"]
=> ["USD", "GBP", "EUR", "AUD", "CDN", "JPY"]
rubinius-2.0.0.rc1 :014 > 100.times { |n| User.create(email: "user#{n}@example.com", preferences: { theme: @themes.sample, language: @languages.sample, currency: @currencies.sample }) }
(0.4ms) BEGIN
SQL (0.9ms) INSERT INTO "users" ("created_at", "email", "preferences", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id" [["created_at", Thu, 18 Apr 2013 22:14:01 UTC +00:00], ["email", "[email protected]"], ["preferences", "\"theme\"=>\"red\",\"language\"=>\"Italian\",\"currency\"=>\"GBP\""], ["updated_at", Thu, 18 Apr 2013 22:14:01 UTC +00:00]]
(0.9ms) COMMIT
...
(0.1ms) BEGIN
SQL (0.4ms) INSERT INTO "users" ("created_at", "email", "preferences", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id" [["created_at", Thu, 18 Apr 2013 22:14:01 UTC +00:00], ["email", "[email protected]"], ["preferences", "\"theme\"=>\"blue\",\"language\"=>\"Italian\",\"currency\"=>\"JPY\""], ["updated_at", Thu, 18 Apr 2013 22:14:01 UTC +00:00]]
(1.1ms) COMMIT
Here I’ve created 100 independent users in a loop storing a random selection of user preferences using Array#sample. Now I’ll see which preferences are most popular.
@themes.each { |t| puts "Users with theme #{t}: " + User.where("preferences @> 'theme=>#{t}'").count.to_s }
(0.4ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'theme=>black')
Users with theme black: 30
(0.3ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'theme=>white')
Users with theme white: 23
(0.2ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'theme=>red')
Users with theme red: 29
(0.3ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'theme=>blue')
Users with theme blue: 18
So here we can see that the theme “black” is most common among the data set just by executing a simple query. We could repeat this for language:
@languages.each { |l| puts "Users with language #{l}: " + User.where("preferences @> 'language=>\"#{l}\"'").count.to_s }
(0.6ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'language=>"US English"')
Users with language US English: 14
(0.3ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'language=>"UK English"')
Users with language UK English: 17
(0.3ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'language=>"Japanese"')
Users with language Japanese: 11
(0.2ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'language=>"Spanish"')
Users with language Spanish: 14
(0.2ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'language=>"German"')
Users with language German: 16
(0.3ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'language=>"Italian"')
Users with language Italian: 15
(0.3ms) SELECT COUNT(*) FROM "users" WHERE (preferences @> 'language=>"Portugese"')
Users with language Portugese: 13
We can see here that UK English was the most popular with this random data set through a simple query to PostgreSQL that would otherwise be pretty rough to churn through without hstore.
Note also the syntax of both queries I’ve shown here. In the first example, none of my objects in the @themes array had any spaces in them – they were all just one-word strings. However the second example did have spaces, so I had to escape quotes to get it to query correctly. That syntax looks like this:
SELECT COUNT(*) FROM users WHERE (preferences @> ‘language=>”Stuff with spaces goes in double quotes *inside* single quotes! Otherwise, omit the double quotes.”’);
Let’s talk about indexes.
Now that we’re reading information out of the database, we have another potential problem: the larger the database gets, the longer it’s going to take to perform queries because we haven’t yet indexed the data.
As a general rule of thumb, anything that’s going to be used in a WHERE clause needs an index. This helps the database find where specific information is faster. The tradeoff with indexes, however, is that they eat up disk space and they have to be maintained. The larger your table, the larger your index in most cases (unless you have a ton of duplicate data, perhaps, depending on the type of index used).
To get this application to perform SELECTs well, we need to index it. The hstore datatype can be indexed with one of two types: GiST or GIN.
Let’s briefly review some differences between GiST and GIN indexes. This is an oversimplification, and for the straight dope I recommend you read the PostgreSQL documentation on GiST vs. GIN indexes, but for the purposes of this tutorial:
- GiST indexes are going to be smaller on disk and faster to build
- GiST indexes may return results slower than GIN indexes
- GiST searches may return false positives, which PostgreSQL has to weed out prior to returning (thus slowing the query’s execution time)
- GIN indexes take about 3 times more time to build than GiST
- GIN indexes will return results faster in most cases, but depends on the logarithmic difference in unique words
Since GIN indexes are faster with more unique words, and we’re storing semi-structured data from the application with lots of potential for duplicate keys and values, it makes sense to use a GiST index. With as much potential for data duplication as exists in this use case, using a GIN index would likely not be all that much faster since it depends on logarithmic difference in unique words to find things quickly, and it would still eat up more space than a GiST index, meaning we’d get the same performance, roughly, but sacrifice extra disk space for no significant gain.
There are two ways we can add this index:
Direct query:
ActiveRecord::Base.connection.execute(“CREATE INDEX user_pref_hstore_gist ON users USING GIST (preferences)”)
Using the activerecord-postgres-hstore gem:
def AddIndexOnPrefs < ActiveRecord::Migration
def change
add_hstore_index :users, :preferences, :type => :gist # or :type => :gin
# Note that :gist is the default. It’s explicitly stated here for
# educational purposes.
end
end
Bringing it all together
This application, being just a demo, isn’t going to do any actual authentication or authorization. The controller, therefore, will look like a pretty standard REST-like controller. The views will be pretty straightforward, and the form for the user object will also be classic Rails.
Controller source code:
class UsersController < ApplicationController
def index
@users = User.all # for demo only, DO NOT do this in production apps
end
def new
@user = User.new
end
def create
@user = User.new(user_params)
if @user.valid? and @user.save
redirect_to @user, :flash => { :notice => t(:user_created) }
else
render :action => :new, :flash => { :error => t(:user_not_created) }
end
end
def show
@user = User.find(params[:id])
end
def edit
@user = User.find(params[:id])
end
def update
@user = User.find(params[:id])
if @user.update_attributes(user_params)
redirect_to @user, :flash => { :notice => t(:user_updated) }
else
render :action => :new, :flash => { :error => t(:user_not_updated) }
end
end
def destroy
@user = User.find(params[:id])
if @user.destroy
redirect_to root_path, :flash => { :notice => t(:user_destroyed) }
else
redirect_to @user, :flash => { :error => t(:user_not_destroyed) }
end
end
private
def user_params
params.require(:user).permit(:email, preferences: [:theme, :language, :currency])
end
end
The only major thing of note here isn’t really specific to PostgreSQL since that’s handled in the model, but pay attention to the use of strong_parameters. In the user_params method, we’re specifying that params[:user] is required, and in that we can permit params[:user][:email], params[:user][:preferences][:theme | :language | :currency] basically. |
Moving on to the views, again it’s classic Rails. Just loop through @user.all for the index (note: bad practice for production for obvious reasons, but again we’ve just got a simple demo here) and render _form.html.erb for edit/new. The form itself might give some people a little trouble, so here’s the source code for that with a brief explanation:
app/views/user/_form.html.erb
<%= form_for @user do |f| %>
<p>
<%= f.label :email %><br>
<%= f.text_field :email %>
</p>
<%= f.fields_for :preferences do |p| %>
<p>
<%= p.label :theme %><br>
<%= p.select :theme, options_for_select(User.valid_themes, @user.preferences["theme"]), { include_blank: true } %>
</p>
<p>
<%= p.label :currency %><br>
<%= p.select :currency, options_for_select(User.valid_currencies, @user.preferences["currency"]), { include_blank: true } %>
</p>
<p>
<%= p.label :language %><br>
<%= p.select :language, options_for_select(User.valid_languages, @user.preferences["language"]), { include_blank: true } %>
</p>
<% end %>
<%= f.submit :class => "btn btn-success" %>
<% end %>
Note that inside the form_for block we’re additionally calling fields_for on the form object, passing it a different symbol representing the attribute we’re serializing in the model, and then creating select lists based on known attributes we’re going to serialize – theme, language and currency in this case. The valid options for those are defined on the class itself as very simple arrays for each method (valid_currencies, valid_languages, valid_themes).
As for the application, those are the most salient points – everything else is pretty standard and/or boring. You can see the full source code at GitHub.
Deploying on Engine Yard Cloud
So you’ve got an application using hstore ready to go, but how do you deploy it? On Engine Yard Cloud this is fairly easy. There’s just one “trick” that you have to do to get things working: enable the PostgreSQL hstore extension via custom chef.
This article isn’t meant to be a primer to custom chef, so we’ll just hit the main points here. First, go clone this github repository on your system somewhere:
https://github.com/engineyard/ey-cloud-recipes
Next, take a look at https://github.com/engineyard/ey-cloud-recipes/blob/master/cookbooks/postgresql9_extensions/README.md in that repository. This readme details the multitude of PostgreSQL extensions available on Engine Yard Cloud, of which hstore is one. Following the directions under “hstore” would have you edit the main chef recipe (cookbooks/main/recipes/default.rb) to enable the recipe by removing the comment line in front of it, as well as the comments in front of the block beginning and end statements. I also enabled the btree_gin and btree_gist recipes as well.
WIth that done, you can upload and run your custom chef recipe by following these directions. Note that this assumes you have an environment running already. This should enable PostgreSQL hstore extensions on your cluster.
Once you’ve uploaded and run your custom chef recipes successfully, you can deploy the application with migrations and see it in use. Remember that if you’re deploying the application I put on GitHub that you should probably choose Rubinius 2.0 in 1.9.3 mode with Puma as your stack. Otherwise, fork it, put “unicorn” in there and try running it on MRI 1.9.3 (or you can try under Passenger as well without changing the Gemfile).
To seed the database once it’s deployed, SSH up to the instance, cd into /data/<appname>/current and issue RAILS_ENV=production bundle exec rake db:seed. If you’re using Early Access Rubinius like I did, that’ll be RAILS_ENV=production ruby -S bundle exec rake db:seed.
PostgreSQL’s hstore extension can provide a lot of extra capability for unstructured or semi-structured data for your application without forcing you to go with a full-blown NoSQL implementation. I wouldn’t recommend using it as a complete replacement for a real NoSQL store, but if you have models or portions thereof that you feel would be better stored and retrieved as unstructured key/value pairs, hstore is probably a good fit for your application.
Share your thoughts with @engineyard on Twitter