Deploying via Git
Git has arguably swept through web development like a sharknado; becoming the most popular VCS in every tech community I know of: PHP, Ruby, JavaScript (and Node.js), Python, Objective-C… like I said, pretty much every tech community.
Using git for deployment is therefore an obvious choice — and being able to deploy by fetching just the differences means it can be fast and easy.
When deploying in the cloud, one of the things you need to keep in mind is the ephemeral nature of the “machines” you are using. Whether it’s degraded instances, or the need to scale up either vertically or horizontally, chances are you’ll need to clone your repository more frequently than you might think.
The speed at which you can deploy is especially critical when you are trying to scale to handle unexpected spikes in traffic.
When cloning a git repository — by default — you clone the entire history of the repository. The size therefore depends greatly on the length of the history, and the amount of changes over the lifetime of the repository. Pushing around that much data can take some time.
__Note:__The following benchmarks are very rough, and will obviously differ based on a number of factors. They were all done from github to the same Amazon EC2 instance however.
Project | Clone Time | Size |
---|---|---|
Zend Framework 2 | 25s | 118MB |
Symfony | 15s | 56MB |
Drupal | 27s | 138MB |
CakePHP | 13s | 48MB |
Zend Framework 1 | 15s | 202MB |
Ruby on Rails | 25s | 129MB |
These were obtained by running:
$ git clone https://github.com/project/repo.git
To help minimize this, git provides a --depth
flag for git clone
— this creates what is known as a shallow clone. The manual has this to say about --depth
(emphasis mine).
--depth <depth>
Create a shallow clone with a history truncated to the specified number of revisions. A shallow repository has a__number of limitations__(you cannot clone or fetch from it, nor push from nor into it), but is adequate if you are only interested in the recent history of a large project with a long history, and would want to send in fixes as patches.
What we can do is to combine this with the --branch
flag to checkout a specific branch with minimal history, minimizing the size. This however limits us: we can no longer checkout tags, or arbitrary revisions.
Note: Leaving out --branch
will simply checkout the tip of the default branch (typically master
).
Doing this we see the following results:
Project | Clone Time | Difference | Size | Difference |
---|---|---|---|---|
Zend Framework 2 | 18s | -7s | 76MB | -42MB |
Symfony | 5s | -10s | 31MB | -25MB |
Drupal | 14s | -13s | 91MB | -47MB |
CakePHP | 6s | -7s | 21MB | -27MB |
Zend Framework 1 | 10s | -5s | 168MB | -34MB |
Ruby on Rails | 12s | -13s | 56MB | -73MB |
These were obtained by running:
$ git clone --depth 1 --branch <branch> https://github.com/project/repo.git
These are (mostly) fairly significant differences. So we’re done, right? Deploys are faster, and we’re all good. Yeah… no.
Shallow clones also bring their own pitfalls.
If you want to switch branches, you must re-clone a new shallow clone. This would usually be handled by using symlinks to move between repositories, something like this:
/var/www
/d2340654-master
/a61f576e-feature-foo
/current -> /var/www/d2340654-master
Where /var/www/current
is a symlink to the currently deployed code. When you change branches, re-clone and then
$ cd /var/www
$ rm -Rf ./current & ln -s ./a61f576e-feature-foo ./current`
If you simply want to update the current branch, you would do:
$ git fetch origin
$ git reset --hard origin/<branch>
However, if someone has rebased, this can break the shallow clone as the only commit it knows about is the currently checked out one. Rebasing will cause that commit to be missing on the remote server, meaning that they no longer have any commits in common upon which to create a diff between the two.
This causes an error that looks like:
Git Error: fatal did not find object for shallow
When this happens, you must again create a new shallow clone. Depending on your workflow, this may happen more or less frequently for you.
If you frequently rebase then I would recommend not using shallow clones as the quick updates will fail more frequently, causing you to re-clone (albeit shallow) more often than if you did a full clone initially. Alternatively, you could potentially play with the --depth
to find a happy medium between the number of times you break the clone with rebases, and the size of the cloned repo.
Shallow clones aren’t perfect, but they can provide a measurable, and noticeable improvement to deployment times during critical scaling situations.
Share your thoughts with @engineyard on Twitter