Disaster Recovery on Engine Yard Cloud, the Cold Standby Edition

We get a lot of requests for disaster recovery, but what exactly is D/R? D/R can mean different things depending on your business needs. Most people think of D/R as a hot standby, but it can also be a cold standby if that meets your business requirements. The process starts with asking yourself a few questions about business needs. How much downtime can we tolerate? How much data can we stand to lose? How much money are we willing to spend? If the answers to the first two questions are minutes and none, then you’ll need a hot standby which we’ll cover in a follow up blog post. If the answers are several hours and some data, then a cold standby may be the most cost effective option for your business. The following is an introduction to setting up a cold standby on Engine Yard Cloud. For the purpose of this post, we are just going to focus on a simple application with a relational database.

We start by using the Copy Environment feature to create an exact replica of the environment in another region. You will want to make sure to use the exact same software versions between environments.

Once the environment is up and running, you should import test data and ensure that the application is working as expected. Next, we need to consider how we will restore data should the cold standby ever go live. For this, we can use the eybackup tool which will require a custom configuration file on the cold standby. The eybackup tool was designed to backup and restore database dumps to a single environment, but with a little tweaking we can use it to download database dumps from other environments. We’ve compiled a Chef cookbook that drops this configuration file as well as a wrapper script into place for you. To use this cookbook, you must update a few values in the attributes file. You will need the application name, database type, source environment name and the bucket name that is storing the backups for the live environment. Grabbing the bucket name is as simple as connecting via SSH to your live database master or slave and viewing the eybackup configuration file. Depending on the database you use it will either be /etc/.mysql.backups.yml or /etc/.postgresql.backups.yml

~ # cat /etc/.postgresql.backups.yml | grep bucket
:backup_bucket: ey-backup-5c8dbfba94b9

Next we will want to consider backup frequency. The question you need to ask is how much data can we tolerate losing? With a cold standby, some amount of data will be lost. If up to an hours worth of data loss is tolerable,you would just set the database backup frequency to every hour. Additionally, you will also want to evaluate all of your other data stores. Can you stand to lose all background jobs in the queue? If not, then the data store for those jobs will need to be backed up using a custom script and the restore process will have to be worked into your recovery plan.

With all business needs considered, it’s time to draft an actual plan to restore services. Note that the command in the third step will change if you are using PostgreSQL. Also, the downoad_backup.sh script will also output this command so that you can cut and paste it into your terminal to kick off the restore.

  1. Boot up cold standby instances using snapshots to speed the process up
  2. When the database instance is booted and the Chef runs have completed, download the latest backup to the instance:
    a. sudo /home/deploy/download_backup.sh
  3. Import the database dump:
    a. sudo gunzip < file_name | mysql app_name
  4. Test application
  5. Switch DNS

Once the plan has been drafted, we highly recommend at least one dry run so that you can practice and familiarize yourself with the process and also so you can time the whole process to get an idea of the amount of downtime you will be facing. With all the dry runs completed to your satisfaction, you can now shut down the cold standby instance so that you aren’t billed for compute hours. Moving forward, you will want to apply all stack upgrades to both the live and cold standby environments. Software parity is very important and you do not want any mismatches between your environments.

In summary, a cold standby can be a very cost effective way to mitigate disasters but you have to be willing to make some concessions in terms of data loss and downtime. In a follow up blog post, we will go over how you can create a hot standby for your production environment to prevent data loss.