March 19th 2010

By Kirk Haines

Ruby Scales, AND It’s Fast - If You Do It Right!

Performance: it’s a topic that comes up over and over again in the Ruby world, and everyone’s got an opinion. Unfortunately, those opinions often focus on minutia, and tend to miss the big picture.

On top of that, things in the Ruby world are far more complex, today, when discussing performance, because one really has to talk about Ruby performance in the context of a specific implementation. Are we talking about Matz Ruby 1.8.x, or 1.9.x? Are we talking about Rubinius or JRuby? What about MacRuby? IronRuby? MagLev? Every one of these has a different performance profile and level of completeness.

For the purposes of this post, and for the purposes of the attention I paid to the two quotes above, I’m going to focus on Matz’ Ruby 1.8.x (MRI). It’s been the Ruby for many years, and it’s what most people are pointing at when they complain about Ruby being slow. Don’t just take my word for it though—check out The Computer Language Benchmarks Game for a substantial set of flawed micro-benchmarks using a plethora of different languages. What they call “Ruby MRI” is, at this time, ruby 1.8.7 (2009-06-12 patchlevel 174). It’s not even close to being the most recent version of 1.8.7, but that’s OK. The benchmarks there have to be taken with a couple grains of salt, anyway.

Here’s why: Micro-benchmarks for languages have only a weak relationship to the performance of complex systems implemented in those languages, even when implemented well. Or, to put it another way, the speed at which a language can complete a simple, discrete task, is not necessarily a strong predictor of how fast a complicated application, composed of many tasks, will perform when implemented in that language. There are other factors which come into play that can strongly influence overall performance; factors like application architecture, and the ability to leverage higher-level built in capabilities, that simplify things which may be complex to implement in other languages.

Many of you probably know people who claim Ruby can’t scale, or is too slow for business-critical web applications. Since you’re reading this, you also know those people are wrong. In fact, it’s usually far easier to scale a Rails application’s web-facing aspect than it is to scale the data storage parts of the application. Nonetheless, scaling that web-facing aspect has costs, and if your application can return content to your customers more efficiently, reducing your hardware needs, you reduce your costs.

Returning to the ruby-talk thread that those quotes came from, my response included an assertion that I thought I could spin up a single Engine Yard Cloud instance, and that running it with an all Ruby stack, I could push 200,000,000 requests through it in less than a day. When I say an all Ruby stack, I’m not talking about the database layer, but rather, the application and anything above it (such as the web server). I wouldn’t use Apache, nginx, or any other non-Ruby web server, and I’d use a real, complex application.

Since I already had a 64bit, 4ECU instance running that I use for testing Ruby 1.8.6 changes, I just used that existing instance. I used Ruby 1.8.6 pl287 for this. I could’ve used use any version, as RVM makes it simple to pick and choose, but that I selected that one because many sites have run on it for a long time (though if you are running on it now, you really should upgrade), and by being a less than current version, it serves my point well.

For generating test traffic, I used the venerable Apache Bench. Even after all these years it’s still got some buggy corner cases, but it’s straightforward and easy to use, and it’s own performance is high enough that it takes some pretty fast test subjects before you start running into the performance limitations of the tool, instead of the test subjects. I ran it on the same machine as my application’s stack because I wanted to eliminate the network as a factor in results, and just feed as many requests to my stack as quickly as possible.

The test application was Redmine, version 0.8.7. I selected Redmine because it’s a complex application familiar to many people, and it’s easy to install. It’s also not yet optimized for speed. Development has been far more focused on features and function than on optimizing for resource usage efficiency. The Rails version that I used is 2.3.2.

So, after installing and configuring Redmine, I started it:

ruby script/server -e production -d

Note that I did not use Mongrel, evented_mongrel, Thin, or anything else sophisticated as the container for the application. It was just webrick, and it was just a single instance of webrick.

I then threw some random data into it just so that there was something other than the empty pages. So, let’s see how it performed!

ab -n 10000 -c 1 http://127.0.0.1:3000/

Hmmm. I rode my exercise bike 1.3 miles while that ran… That didn’t feel fast at all.

Requests per second:    33.98 [#/sec] (mean)
Time per request:       29.432 [ms] (mean)
Time per request:       29.432 [ms] (mean, across all concurrent requests)

OK. I mean, that’s not horrible. Redmine isn’t a lightweight app, and that’s over 2.5 million requests a day on a single process. What happens if there’s some concurrency?

ab -n 10000 -c 25 http://127.0.0.1:3000/
Requests per second:    31.11 [#/sec] (mean)
Time per request:       803.707 [ms] (mean)
Time per request:       32.148 [ms] (mean, across all concurrent requests)

That was a 1.4 mile benchmark ride. Shoot; does that mean Ruby really is slow? That did not go in the direction we need, and let’s be real here: in a real application deployment, there are going to be concurrent requests—many of them, if you’re at all successful. It’s pretty clear what direction everything was moving in, but I wanted to take it one step further.

ab -n 10000 -c 500 http://127.0.0.1:3000/
Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
apr_socket_recv: Connection reset by peer (104)

Well, good to know. Clearly, Redmine running inside of webrick can scale, but there are limits that aren’t too hard to hit on a single process. If we were spreading these requests over multiple processes on multiple instances, we could reasonably scale to many millions of requests per day, even running our code on webrick, assuming that the database layer could keep up with all of that. However, that’s still a long way from two hundred million requests per day.

Even if we were running on a Ruby implementation that was 2x as fast, or 5x as fast, and even if the application were running in a faster container, the basic problem is still the same—we’d have to throw hardware at it until the problem went away. Even if you spent a lot of time laboriously building Redmine in C++ while focusing on performance, you still wouldn’t escape the need, with this simple architecture, to throw hardware at the problem. So, what do you do if you need more throughput out of your application, but aren’t excited about adding more hardware resources?

Consider these runs:

ab -n 10000 -c 1 -C '_redmine_session=9ec759408f1ae3c6f919e50baba5a3dc; path=/' http://127.0.0.1/
Requests per second:    2839.37 [#/sec] (mean)
Time per request:       0.352 [ms] (mean)
Time per request:       0.352 [ms] (mean, across all concurrent requests)

ab -n 10000 -c 1000 -C '_redmine_session=9ec759408f1ae3c6f919e50baba5a3dc; path=/' http://127.0.0.1/
Requests per second:    3862.33 [#/sec] (mean)
Time per request:       258.911 [ms] (mean)
Time per request:       0.259 [ms] (mean, across all concurrent requests)

ab -n 100000 -c 25 -k  -C '_redmine_session=9ec759408f1ae3c6f919e50baba5a3dc; path=/' http://127.0.0.1/
Requests per second:    7797.39 [#/sec] (mean)
Time per request:       3.206 [ms] (mean)
Time per request:       0.128 [ms] (mean, across all concurrent requests)

I barely had time to turn the cranks on the exercise bike for those runs! It turns out that to get that performance, I needed to look at my architecture and rethink how I was positioning my application’s web facing aspect. Most applications, even highly dynamic ones, show lots of the same stuff to the users. In many cases completely identical content is being displayed for many different users. It’s senseless to regenerate this content over and over again. This is where caching enters the architecture picture.

Rails 2 has some built in support for caching. It’ll do page caching, which basically writes a static copy of a dynamically generated page to a persistent location, so that on subsequent hits the web server can deliver the page. This works great, but it has limitations.

All content, for everyone, for a given URL must be identical, and you’re responsible for providing a sweeper that clears old content. Also, requests will still fall down to your web server, which may mean that you still encounter some significant performance penalties when delivering your content in some situations. For example, nginx delivers static files quite quickly if it’s sitting on top of a fast disk. Sit it on a slow disk, though, and page caching returns limited dividends. If it can work for your application though, use it.

Rails also supports partial caching in some different guises—to the file system, to memory, to memcached, etc. Partial caching can be a win architecturally, because it bypasses all of the heavy work involved in generating content; your app can just assemble pregenerated fragments into a complete page. If you haven’t done so, look into that as well. It can be very helpful.

Along those same conceptual lines, there’s also edge side includes, or ESI. ESI essentially lets one’s application return a skeleton of a page, or an incomplete page with some special markup embedded. The proxy that receives that content, and that understands ESI markup can then insert content, either from its own cache, or from a subrequest that it issues to some other URL.

This lets a proxy cache a generated, but incomplete page, yet still fill it out with smaller pieces of dynamically generated content without pushing all of that work back into the dynamic application. So it’s a bit like partial caching, but it’s handled at a shallower level in the stack. I’ve heard that Rails 3 will have a plugin to facilitate the use of ESI, and that it may come built in with a later dot release. Not all reverse caching proxies support ESI, but many of them do.

For Redmine, page caching doesn’t work very well. It, like many applications, uses cookies. Applications can use cookies to identify users, to handle authentication, or to persist data on the user’s browser, instead of on the server. When an application needs to deliver cookies in addition to content, simple page caching won’t work. Redmine falls into this category. And besides… I promised to use a Ruby stack, so leveraging Nginx or Apache to serve files from a page cache would be cheating.

What I really needed was a caching reverse proxy that would sit in front of the application. It had to be smart enough to do the right thing with regard to caching content that has cookies attached (at least for some definition of the right thing), and it had to be stubborn enough to not-quite follow the Cache-Control headers that Redmine set. It needed to be implemented in Ruby, and it be fast enough to be worthwhile.

Most caching reverse proxies are implemented in fast languages. Varnish, one of the fastest caching reverse proxies, is written in C. Nginx , which can be configured to provide a caching reverse proxy, is also implemented with C, as is Squid, one of the oldest proxy servers. Traffic Server is implemented with C++.

Refer back to the benchmarks site. C is a lot faster than MRI Ruby. C++ is significantly faster, too. So, to borrow a phrase from my grandmother, how on God’s green Earth do I expect to write a proxy in Ruby that can compete with one in a language that benchmarks 100x-200x faster than it is?

Bullheaded stubborness in the face of ignorance? Well, yes, a little bit, combined with some specific architectural decisions. Most of those proxies try to do everything. I think there are probably configuration options in Squid that would get it to cook breakfast for me. Traffic Server probably won’t cook breakfast for me, yet, but it will make the bed, and somewhere in the TODO, I’m sure they have plans to allow for it to make breakfast, too, if you can figure out how to configure it. Varnish is one of the fastest proxies, and it gets its speed, in large part, because it won’t make the bed or cook my breakfast. It’s like Charles Emerson Winchester III from M.A.S.H., “I do one thing at a time, I do it very well, and then I move on.” Varnish does still take some configuration eduction to get it to work well, though.

And that is the secret to keeping things fast. Or, at least one of the secrets, anyway. I took it one step further. My approach was:

Do one thing at a time, do it well enough, and then move on.

A couple of years ago I wrote a very fast proxy and simple web server in Ruby that I called Swiftiply. It leverages EventMachine for handling network traffic, and then tries to squeeze the rest of the performance that it needs out of Ruby by not providing any more capability than is really needed to get the job done. Someone once said that “No code is faster than no code.”

Swiftiply didn’t provide enough capability for a caching reverse proxy, but it did have the capability to serve and cache static assets very quickly (on a lot of hardware my benchmarking efforts have run up against Apache Bench’s own performance limits), and it did already function as a proxy, so much of the capability was there. One advantage to it being written in Ruby was that it was relatively straightforward for me to add additional capability to it. So I did.

To really handle Redmine properly requires the ability to cache different versions of the same URL, where the only differentiator is the cookies. Also, Redmine sets a Cache-Control header that looks like this:

Cache-Control: private, max-age=0, must-revalidate

Without digging into it deeply, this means that public caches should not cache the content, and private caches need to confirm with the server that it has valid content before using it. But we want to ignore that (unless Cache-Control is set to no-cache, in which case we’ll pay attention), because we do want to keep private content cached, and we do not want to have to always go back to the application to revalidate on every request. My assumption is that it is OK if, for example, a new issue is added, but it takes a few seconds before a url which shows the issues is refreshed to display that new issue.

The end result is a caching reverse proxy with very few tuning knobs, and behavior that’s not quite HTTP 1.1 correct, but that is very fast, stable, and hackable. It’s probably not actually as fast as it could be, since I piggy backed the implementation onto something that’s doing more than I really need, but it’s good enough. Ruby, as a “slow” language, delivers on something that runs very fast and is good enough for the goal that I had.

If you’re wondering how many requests were pushed through my Ruby stack in 24 hours:

Requests per second:    3283.09 [#/sec] (mean)

That’s 283,659,084 requests in 24 hours (and none of them were keepalive requests). All handled in a Ruby stack. All with a completely browseable and useable Redmine installation that was still responsive while the test was running; I added issues, edited them, removed them, and did administrative actions with no perceptible delays.

I readily admit that this isn’t a test that faithfully simulates real production loads; you probably aren’t going to roll out a production web app servicing two or three hundred million requests a day on a single modestly sized EY Cloud instance. But if you were doing something that wasn’t going to be bottlenecked by the data store, you just might be able to do it, all with slow, slow Ruby. Not bad.

It’s no Varnish, and it never will be. Varnish does far more, more correctly, and all a little bit faster. Varnish also requires some careful tuning to run well, and is not nearly so hackable— so there are tradeoffs. If you neede more performance out of your application, look closely at what a caching reverse proxy can do for you. In the larger view of your application’s deployment architecture, it can make a tremendous difference in your users’ experience. Varnish is a great piece of software, and deserves a post of its own covering configuration and usage.

And if you truly find that you need some specialized capability, don’t be afraid to spike something out with Ruby. Paying a little attention to writing lean code that delivers just the capabilities that you need can result in surprisingly fast, capable code, even in a slow implementation of a slow language like Ruby ;)

Questions and comments welcome!

Share your thoughts with @engineyard on Twitter