Polyglot Background Jobs

There’s many things we end up needing to perform background jobs for; but the main reason is to provide a snappy, non-blocking user experience.

Whether that task is encoding a video file, batch data import, or (in one case I ran into) jabber instant messaging, we want to offload them from our web servers as quickly as possible.

There are lots of tools to accomplish this across all languages, including Resque, Sidekiq, delayed_job, node-schedule, beanstalkd, Amazon Simple Queue Service (SQS) and then there is my personal favorite: Gearman.

Gearman has client libraries in C, PHP, Ruby, Node.js, Python, Java, Perl, C#/.NET and even includes tools that can be called via shell script, and user-defined functions for both MySQL and PostgreSQL.

Gearman itself is written in C, and is super simple. If you get a chance, I highly recommend checking out the source code. Note: gearman was originally written in Perl and later re-written in C. Be sure not to use the perl version (e.g. dev-perl/Gearman* in Gentoo portage).

The main reason I like gearmand is it’s simplicity. Gearman has three parts to it:

  1. GearmanClient submits tasks to the job queue
  2. gearmand is the job queue itself (running as a daemon)
  3. GearmanWorker retrieves the tasks from the job queue and handles them

Gearman Communication Diagram-1

(View Large)

By default, the Gearman queue is stored in memory, however you can also make it persistent and stored in MySQL, PostgreSQL, memcached or SQLite. With memcache, obviously if it’s on the same machine as gearmand then you’re likely to lose it just as easily as the regular queue. The only difference is that you could re-start gearmand without losing the queue.

However, another potential option is to use the new MySQL 5.6 NoSQL Interface, which supports the memcached protocol. This should be faster than using the Gearman MySQL backend without sacrificing the persistence it brings.

It obviously has the ability to run background jobs being as this is what this post is all about, but it also foreground jobs which allow the GearmanClient and the GearmanWorker to communicate with each other using gearmand as the middle-man.

The best thing about Gearman, is that you can use different languages for different pieces. So you build your website in PHP, but maybe it’s not the best option for wrangling text; so you schedule a job with gearmand, and a Python worker picks it up. Or Ruby, or Node.js, or… you get the idea.

What this allows us to do is to pick the correct tool for every task in our stack. Why workaround the pitfalls of our primary language when you can simply pick up a better tool and do things right.

Using Gearman

First we are going to use PHP to schedule a task with the job queue. This uses the pecl/gearman extension.

function createBackgroundJob($task, $data = array()) {
    $client = new \GearmanClient();
    $client->addServer(/* Defaults to 127.0.0.1, 4730 */);
    $handle = $client->doBackground($task, json_encode($data));

    if ($client->returnCode() != GEARMAN_SUCCESS) {
        return false;
    }

    return $handle;
}

In this simple example we create an instance of the \GearmanClient class, tell it to connect to the default server (localhost:4730) and send a background task ($client->doBackground()).

Next we ensure that the task was added successfully, and return the job handle.

We might call it with something like this, passing in the username:

$handle = createBackgroundJob('sendWelcomeEmail', ['username' => 'dshafik']);

We would then want to store the handle so that we can later check the status of the task.

The Worker

Next we’ll create a worker, this time in Ruby:

require 'rubygems'
require 'gearman'
require 'json'

servers = ['localhost:4730']
worker = Gearman::Worker.new(servers)

# Add a handler for the "sendWelcomeEmail" task
worker.add_ability('sendWelcomeEmail') do |data,job|
    data = JSON.parse data
    user = User.first(:conditions => [ "username = ?", data["username"] ])
    user.sendWelcomeEmail();
end
loop { worker.work }

Here we use the gearman-ruby gem to create a Gearman::Worker, and then register the task handler.

In this case, we first decode the JSON data passed in from our GearmanClient and then find our user in the database by the username. We then call the sendWelcomeEmail method.

For something that takes more time, you could send back a running status. The job variable is an instance of Gearman::Worker::Job class which allows you to respond using job.report_status(numerator, denominator).

It’s important to note that you can run as many workers for each task as you’d like, Gearman will not hand the same job to multiple workers (however, there is a re-try config option should it fail) and because they are pulling jobs it will not overload the workers, though you may run out of them. The number of workers you run can also act as a way to manage priority — higher priority jobs get more workers — and balance resources.

Checking the Status

Finally, we’ll need a way to check the status of the request. For this we’ll use Node.js/Javascript. In our case we are only looking to see if the job has completed as we haven’t send any other status.

var http = require('http'), 
    url = require("url"),
    querystring = require("querystring"),
    gearman = require("gearman");

var server = http.createServer(function (request, response) {
    var client = gearman.createClient();
    var query = url.parse(request.url, true).query;

    if (!("handle" in query)) {
        response.writeHead(404, {"Content-Type": "text/plain"});
        response.end("Job not found!\n");
    } else {
        var status = { };
        client.getJobStatus(query.handle, function(s) { 
            if (s) {
                status = s;
            }

            response.writeHead(200, {"Content-Type": "application/json"});
            response.end(JSON.stringify(status));
        });
    }
});

server.listen(8000);

This creates an HTTP server on port 8000 that when passed a handle via GET arguments will return the status.

Using Gearman with Engine Yard Cloud

In order to make Gearman a part of the background job processes on your Engine Yard Cloud account, it is necessary to create a custom chef recipe to compile it yourself (chef recipes can be used to take advantage of software outside of the current stack). For more details on using Chef with Engine Yard Cloud, check out our knowledge base.

As with all background jobs, best practices recommend Gearman be run on an Utility Instance, so that all issues are processed without interfering with the Application Instances themselves.

Can’t we all just get along?

So, as you can see, Gearman can act like glue between the various parts of your application. It’s super fast, has low resource usage and can be used with almost any language you can think of.

Additionally, it can not only do foreground tasks (with communication), but can also prioritize jobs into high/standard/low priority queues.

You can also easily scale Gearman as the clients and workers both support multiple servers, allowing you to spread your queue, and your workers out over multiple machines.

I highly recommend checking it out at http://gearman.org.

About Davey Shafik

Davey Shafik is a full time PHP Developer with 12 years experience in PHP and related technologies. A Community Engineer for Engine Yard, he has written three books (so far!), numerous articles and spoken at conferences the globe over.

Davey is best known for his books, the Zend PHP 5 Certification Study Guide and PHP Master: Write Cutting Edge Code, and as the originator of PHP Archive (PHAR) for PHP 5.3.