Zero Downtime Deploys with Migrations

I gave a talk at RubyConf that went over how we automate our development and deploy process here at Engine Yard. The part of the talk that got the most questions and spawned the most discussion was about the way we do deploy with zero downtime, even with migrations. Here is a bit more detail on the process.

A Caveat

At Engine Yard, we almost exclusively use DataMapper for our ORM. This means that we specify all of the properties for our models in the model class, there is no database introspection to generate accessors for the properties. I have not tried this technique with ActiveRecord but the concepts will be the same.

Example: Adding a column

This is the simplest example, but also the most common migration that we do.

First, before merging any part of the feature to master that needs a new column, create a migration on master that adds the column.

migration 1, :"add published_at to posts" do
  up do
    modify_table :posts do
      add_column :published_at, Time
    end
  end
end

Then we simply test the migration, and ship it all the way to production. Now the column is created in the database—ready for us to use when we merge the code that requires the column.

This also allows us to always run migrations last in the deploy, after we have restarted the app servers. This means that new code is running on the app servers earlier in the deploy process.

Renaming a Column

Adding a column is pretty straightforward but there are some migrations that are much trickier to do with zero downtime.

Lets say you have a Post model, and you want to rename the body field to content.

We have a solution for this as well.

Step 1: Add the new column

Same as above, write and deploy the migration to add the new column to the database.

migration 1, :"add content to posts" do
  up do
    modify_table :posts do
      add_column :content, Text
    end
  end
end

Step 2: Make the code aware of both columns

We usually do this by writing the code as if we were solely using the new column, then having a few accessors and hooks on the model to sync the data over.

class Post
  def content
    update_content
    attribute_get(:content)
  end 

  def content=(new_content)
    self.body = new_content
    attribute_set(:content, new_content)
  end 

  before :save, :update_content
  def update_content
    self.content = body
  end
end

Step 3: Migrate the data

Now that your code is handling the migration of data as it runs, we need to migrate all columns to the new schema.

We can either use a migration for this or just write and run a rake task to copy all data from the body column to the content column.

Once this step is complete, the content column should be canonical and remain that way due to the way the code is written.

Step 4: Remove the temporary syncing code

We can now remove all references to the body column, including the code that syncs the data between the two columns.

At this point, we are almost done. The code is all updated not to use the old column; it only remains in the database.

Step 5: Drop the column

Finally, all we need to do is write a migration to drop the column from the database and deploy it to production. The code does not reference the column, so we can deploy it at our leisure.

YMMV

This technique has worked very well for us at Engine Yard. It can be more complex, but overall it leads to less friction in the deploy process. When you know that there is no downtime needed for any step of your work, you can simply write code and ship it to production when you’re ready, rather than being blocked on waiting for a maintenance window.

Adding a column to the database is by far the most common kind of migration that we do. And using this technique still makes that very simple. Over all it has not added a significant amount of overhead to getting things done, and the lack of friction is a huge win.

Doesn’t always work

All this said, there are a small number of cases when you really cannot perform this technique. These are cases where the migration will take a long time and lock tables or cause heavy load on the database server. In the past year, these kinds of migrations have been the only ones that we have had to take downtime for. The most common manifestation was adding an index on a very large table. But if your database is heavily loaded, many ALTER TABLE statements may take an unacceptably long time.