The Importance of Testing Database Backups

Catching some attention in technology headlines this week, GitLab suffered a major loss of database data and had a long and difficult recovery due to a combination of failing and untested backup and recovery strategies. We commend their openness in responding to this incident and several members of our team benefitted from joining the public livestream of their recovery efforts.

Observing this unfold in realtime encouraged us to take a step back and think about our backup and recovery strategies, and whether we are really doing enough to encourage our customers to test and understand this critical part of their application services. This also was a driving force behind the development of a brand new tool, eyrestore, which is designed to make the restoration of logical backups easier between different Engine Yard environments.

Backup Defaults

By default an Engine Yard database receives both logical (mysqldump/pg_dump) and physical (AWS snapshot) backups once daily, and 10 copies of each backup type are maintained. Once an Environment has been shutdown, these backups will remain available for 90 days; if the environment is deleted, these backups will be removed within 24 hours. The schedule and retention of these backups can be managed from the cloud dashboard, and we also have custom cookbooks that can be helpful in managing more complex requirements.

A snapshot will be used when you are building a new replica database, or when you are terminating and rebuilding all instances in an environment. They can’t be used when downloading data locally, transferring data to a new environment (this may change), changing between filesystem types (ext4, encrypted-EBS), or when upgrading to a newer major version of the database. Snapshots also cannot be used to restore part of a database, or to change the name of object(s) that are being restored. The saving grace for snapshots is they are FAST to work with.

A logical backup is the primary type of database backup worked with during a data recovery or data transfer. While these are slower to create and restore, they are represented as a physical file, and offer the greatest amount of flexibility when restoring data. Logical backups can be downloaded, copied, manipulated, and easily transferred between environments and used across different versions of a given database.

I have a replica, Why should I care?

A replica can be helpful for quick recovery, but can easily have its data damaged by the same process that damaged the master. Suppose a command like delete from users; is run against your database. Chances are good that this statement is going to be processed on your replica before you even realize you left off the WHERE clause.

In fact, just a couple weeks ago we helped a customer where a rake db:setup was run against against Production when rake db:migrate returned structure errors; this actually resets the database using the structure.rb, so all the database data was lost. The replica dutifully replicated these statements so the only option was to resort to restoring the backup. Fortunately, the most recent backup was only about an hour old, and only took about 10 minutes to restore with eybackup.

In short, a replica gives you access to a single point in time of your database, and that point is, typically, always moving forward in time. It’s also usually within a few milliseconds or seconds of the master’s data state which limits its value for data recovery. It generally can’t give you the database state 1 hour ago, 1 day ago, or 1 week ago, which can be critically important if you don’t notice a data loss right away.

Don’t You Test Backups?

Well, Yes and No. We don’t have a targeted process of performing restore tests for each environment for each customer. Such an undertaking would require us to create and destroy resources for each customer on a regular basis increasing the operating cost of each customer. It would also require that we regularly access your data, which is actually pretty inappropriate; and for those of you that use GPG encrypted backups, downright impossible.

That said, Yes. Our backup tools have both unit testing and integration testing built in, and our Data team puts those tools through extensive, hands-on testing before publishing any new updates. In addition to this, every replica created relies on a Snapshot backup, and in any given week, someone, somewhere is doing something with a logical backup created by our tools. So backups and restores do absolutely get tested, just not necessarily your backups.

Testing Strategies

In general a backup restore test should be done weekly or monthly, or anytime you have any significant changes in data size or database structure. This will of course vary by application and business model, so find the schedule that works right for you and formalize it with a scheduled event. When you are doing tests, you want to pay attention to:

  1. Data Quality: is the data correct and complete.
  2. Recovery Time Objective: does the retrieval and restoration of the backup complete in a reasonable timeframe for your business requirements.
  3. Recovery Point Objective: what is the maximum amount of data you could lose if a loss occurs right before the next backup starts, and does this meet your business requirements.
  4. Backup History: how many days of backups do you currently have and does this meet your business requirements.

For validating snapshots, a good approach would be to clone your Production environment. This process uses the most recent snapshots on all your primary hosts (app_master, db_master, and each utility) and builds a new environment using those snapshots as the source of data. You can optionally use this environment as one of your test environments, or just do some checks to make sure your application contains the data you would expect and responds as expected. For validating just a database, a good option can be to create a replica.

For logical backups, a good option would be to restore a backup from Production either in a Testing/Staging environment, or locally. Depending on your specific business requirements, restoring these backups can actually be a valuable and important part of your development lifecycle. Recently we wrote about some faster ways to move database data between environments which also discussed the traditional method of testing backups. While these approaches still have their uses, we decided to create a new tool dedicated to simplifying the process of working with backups between environments.

Introducing eyrestore

For a while now we’ve had customers copying their production eybackup configuration files to Staging (under a unique name) and then referencing those to do restores from Production. While this is functional, it is not something we could recommend or publish since if you supply the wrong parameters you could end up creating a backup of Staging and storing it with the Production backups; you wouldn’t be able to tell the backups apart, and this also messes up retention counts. We needed a solution that was a bit smarter.

The eyrestore tool is a wrapper that relies on existing eybackup functionality. Essentially, what it does is accept specific parameters such as the source environment, source database, and action to take (list, download, restore) and generates a temporary configuration file that it then uses with eybackup. Since eyrestore has its own CLI that limits what it can do, and it removes the temporary configuration file that references Production automatically, it eliminates the concern around incorrectly storing a Staging backup.

But this wasn’t going far enough. When you restore a backup currently you have to supply an index reference made up of the database name and the order of which the backup is taken. The oldest backup has an index of 0:<dbname> while the newest has the highest index count (e.g. 9:<dbname> with 10 backups). So this makes it really difficult to reference the most recent backup for a given environment. So, we designed eyrestore to recognize an index of last as referring to the most recent backup.

Finally, we gave it its very own configuration file, in addition to the temporary configuration, allowing you to setup multiple default profiles and further customize it to fit your needs.

Usage is pretty straightforward, for example, from an environment named staging with a database named todo you could restore the production database as follows:

sudo -i eyrestore --env production --database todo --action restore --index last

and the output might look something like:

You are restoring the backup for todo from 'production' into 'staging', THIS MAY BE DESTRUCTIVE; are you sure you want to proceed (Y/n) ? [timeout=30secs]:y
Running 'restore' on index '9:todo'.
2017-02-02 20:53:52 +0000 Restoring todo
2017-02-02 20:53:52 +0000 Downloading production.todo/todo.2013-01-09T01-20-03.dump to /mnt/tmp
Filename: /mnt/tmp/todo.2013-01-09T01-20-03.dump
Restore complete!

Additional details about working with database backups and restores can be found in the Engine Yard Cloud docs.

Availability

The eyrestore tool has been added to the ey_cloud_server gem that hosts our eybackup tool; and will be rolled out for Stable-V4 and Stable-V5 stacks in the next couple weeks. Don’t fret though, in the hopes of encouraging more of our customers to test their backups we designed it specifically so it would be easy to setup manually. If you’d like to take it for an early spin please open a ticket with our Support team and we’ll help you sort out the details.

Our Data team is hard at work on a long list of new features for 2017, but if you’ve got some suggestions we’d love to hear from you.